Setting up a local DBpedia mirror with Virtuoso

Newer version available: Setting up a local DBpedia 2014 mirror with Virtuoso 7.1.0

So you’re the guy who is allowed to setup a local DBpedia mirror for your work group? OK, today is your lucky day and you’re in the right place. I hope you’ll be able to benefit from my hours of trials and errors ;) If anything goes wrong, feel free to leave me a comment below.

Versions

There is an updated version of this post for DBpedia 3.7: Setting up a local DBpedia 3.7 mirror with Virtuoso 6.1.5+.
Originally this guide was written for DBpedia 3.5.1 and Virtuoso 6.1.2. Later it was updated to DBpedia 3.6 and Virtuoso 6.1.3. The overall process and the main pitfalls are the same, so if you for any reason want to use DBpedia 3.5.1 just make sure to use the correct dumps.

Prerequisite

Before starting you basically need one thing: a powerful server / vm with a lot of ram (we used 8 or 4 cores with a total of 32 GB ram, but as you can imagine this strongly depends on what you want to do with your mirror). I’d suggest setting it up with Ubuntu 10.04 LTS but that’s a matter of choice. Anyhow, you’ll need root access.

A few more words:
The following will largely be bash syntax, sometimes we switch to sql, so comments start with # or --.
Should a command take longer than 5 minutes the approx. times are given.
Overall time for this setup (en and de DBpedia 3.5.1) was 7 hours, (en DBpedia 3.6 only took 5 hours) once I had figured out what to do and what not to do. This was tested on Ubuntu 9.10 and 10.04 (both 64-bit) with Virtuoso 06.01.3127.

Let’s go

Download and install virtuoso

Go and download virtuoso opensource: http://sourceforge.net/projects/virtuoso/
Put the file in your home dir on the server, then do the following.

sudo aptitude install libxml2-dev libssl-dev autoconf libgraphviz-dev libmagickcore-dev libmagickwand-dev dnsutils
cd ~
tar -xvzf virtuoso-*
cd virtuoso-opensource-6.1.2
CFLAGS="-O2 -m64"
export CFLAGS
./configure --with-layout=debian # NOTICE: this will _not_ install into /usr/local but into /usr (so might clash with packages by your distribution if you install "the" virtuoso package)
# You'll find the db in /var/lib/virtuoso/db !
# check output for errors and FIX THEM! (e.g., install missing packages)
make -j5

This will take about 1 hour. In parallel, you might want to start with downloading the DBpedia files (next section) and come back.

sudo make install
sudo mkdir -p /usr/local/data/dbpedia/3.6/en

Now change the following values in /var/lib/virtuoso/db/virtuoso.ini, the performance tuning stuff is according to http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtRDFPerformanceTuning:

[Parameters]
DirsAllowed              = ., /usr/share/virtuoso/vad, /usr/local/data/dbpedia
# IMPORTANT: for performance also do this
[Parameters]
NumberOfBuffers          = 2400000
# each buffer caches a 8K page of data and occupies approx. 8700 bytes of memory
# it's suggested to set this value to 65 % of ram for a db only server
# so if you have 32 GB of ram: 32*1000^3*0.65/8700 = 2390804
# default is 2000 which will use 16 MB ram ;)
[Database]
MaxCheckpointRemap = 625000
# set this to 1/4th of NumberOfBuffers

The next step installs an init-script (autostarts) and starts the virtuoso server. (If you’ve changed directories to edit /var/lib/virtuoso/db/virtuoso.ini, go back to the virtuoso source dir!):

sudo cp debian/init.d /etc/init.d/virtuoso-opensource &&
sudo chmod a+x /etc/init.d/virtuoso-opensource &&
sudo bash debian/virtuoso-opensource.postinst.debhelper

Downloading the DBpedia dump files

We have decided that we only needed the German and English files (this was for DBpedia 3.5.1, with DBpedia 3.6 we only needed and installed English) in NT format, if you need something different do so. For example if you decide to download the all-languages tar then make sure to exclude the NQ files from the later importing steps. One simple way to do this is to move everything you don’t want to import out of the directory. Also don’t forget to import the dbpedia_3.*.owl file (last step in the script below)!
Another hint: Virtuoso can only import plain (uncompressed) or gzipped files, the DBpedia dumps are bzipped, so you either repack them into gzip format or extract them. On our server the importing procedure was reasonably slower from extracted files than from gzipped ones (ignoring the vast amount of wasted disk space for the extracted files). File access becomes a bottleneck if you have 8 cores idling. This is why I decided on repacking all the files from bz2 to gz. As you can see I do the en and de repacking in parallel, if that’s not suitable for you, feel free to change it. You might also want to change this if you want to do it in parallel to downloading. The repackaging process below took about 1 hour but was worth it in the end. The more CPUs you have, the more you can parallelize this process.

sudo -i # get root
# see comment above, you could also get the all_language.tar or another DBpedia version...
mkdir -p /usr/local/data/dbpedia/3.6/en
cd /usr/local/data/dbpedia/3.6/en
wget -r -np -nd -nc -A'*.nt.bz2' http://downloads.dbpedia.org/3.6/en/

# if you want to save space do this:
for i in *.bz2 ; do bzcat $i | gzip --fast > ${i%.bz2}.gz && rm $i ; done &
# else do:
#bunzip2 *&

mkdir ../de && cd ../de
wget -r -np -nd -nc -A'*.nt.bz2' http://downloads.dbpedia.org/3.6/de/
# if you want to save space do this:
for i in *.bz2 ; do bzcat $i | gzip --fast > ${i%.bz2}.gz && rm $i ; done &
# else do:
#bunzip2 *

cd ..
wget http://downloads.dbpedia.org/3.6/dbpedia_3.6.owl

# notice that the extraction (and repacking) of *.bz2 takes quite a while (about 1 hour)
# gzipped data is reasonably packed, but still very fast to access (in contrast to bz2), so maybe this is the best choice.

In the mean time let’s start with the DBpedia bulk loading tutorial. We need to setup a few things before we can start, so go on while still downloading & repacking, if virtuoso is running (else go back to the previous section):

Installing the bulk loader scripts

As a reference only: VirtBulkRDFLoaderExampleDbpedia
A few words about that tutorial: I found it to lack much information, so for this tutorial: USE WHAT’S WRITTEN HERE!
We’ve already set the DirsAllowed param in the .ini file and downloaded the files (or go back and continue with the installation of Virtuoso now). Now let’s execute the bulk loading script taken from:

http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderScript.

Copy the contents into a plain text file called VirtBulkRDFLoaderScript.vsql in your home dir.

Note: in the following i will assume that your virtuoso isql command is called isql. If you’re in lack of such a command it might be called isql-vt.

cd
# this will only install all the needed db procedures doing the work later on.
isql localhost dba dba VirtBulkRDFLoaderScript.vsql

Cleaning problematic parts out of the DBpedia dump

OK, now a word about the DBpedia dumps:
I had problems importing a couple of files in the later steps, in all DBpedia versions i tried, due to URI-lengths of much larger than 1024:

For DBpedia 3.5.1:

  • external_links_en.nt.gz contained 104 such URIs
  • page_links_en.nt.gz contained >5 such URIs (stopped the long counting process after these).

For DBpedia 3.6:

  • external_links_en.nt.gz
  • infobox_properties_en.nt.gz
  • page_links_en.nt.gz

To see these problems yourself do the following:

cd /usr/local/data/dbpedia/3.6/en
# zcat external_links_en.nt.gz | cat -n | grep -E '^[[:space:]]*[0-9]+[[:space:]]*<.+> <.+> <.{1025,}> .$'

To fix this (by excluding these lines) do (takes about 30 minutes for page_links_en):

cd /usr/local/data/dbpedia/3.6/en
for i in external_links_en.nt.gz page_links_en.nt.gz ; do
  echo -n "cleaning $i..."
  zcat $i | grep -v -E '^<.+> <.+> <.{1025,}> .$' | gzip --fast > ${i%.nt.gz}_cleaned.nt.gz &&
  mv ${i%.nt.gz}_cleaned.nt.gz $i
  echo "done."
done

Importing DBpedia dumps into virtuoso

Now AFTER the re-/unpacking and cleaning of the files we will register all files in the dbpedia dir (recursively ld_dir_all) to be added to the dbpedia graph. As mentioned above: If you use this method make sure that only files reside in the given subtree that you really want to import.
If you only want one directory’s files to be added (non recursive) use ld_dir.
If you manually want to add some files, use ld_add.
See the VirtBulkRDFLoaderScript.vsql file for args to pass.

isql # enter virtuoso sql mode
-- we are in sql mode now
ld_dir_all('/usr/local/data/dbpedia/3.6', '*.*', 'http://dbpedia.org');
-- do the following to see which files were registered to be added:
SELECT * FROM DB.DBA.LOAD_LIST;
-- if unsatisfied use:
-- delete from DB.DBA.LOAD_LIST;
EXIT;

OK, now comes the fun (and long part: about 5 hours)… We registered the files to be added, now let’s finally start the process. Fire up screen (see comment) if you didn’t already.

isql
rdf_loader_run();
-- will take approx. 4 hours with 32 GB RAM and 4 CPUs if set as above!
-- DO NOT USE THE DB BESIDES THESE COMMANDS:
-- (I had some warnings about a possibly corrupt db in the log,
--  when I visited the virtuoso conducter during the first run...)
-- you can watch the progress from another isql session with select * from DB.DBA.LOAD_LIST;
-- if you need to stop the loading for any reason: rdf_load_stop ();
-- if you want to force stopping: rdf_load_stop(1);
checkpoint;
commit WORK;
checkpoint;
EXIT;

Testing your local mirror

After this:
Take a look into var/lib/virtuoso/db/virtuoso.log file. Should you find any errors in there… FIX THEM! You might use the dump, but it’s incomplete then. Any error quits out of the loading of the corresponding file and continues with the next one, so you’re only using the part of that file up to the place where the error occurred. (Should you find errors you can’t fix in the way I did above, please leave a comment.)

Go to the sparql-endpoint of your server http://your-server:8890/sparql (or in isql prefix with: SPARQL)

SELECT COUNT(*) WHERE { ?s ?p ?o } ;

This might take about 10 minutes and then returns 258867871 (257134114 for en DBpedia 3.6). Subsequent queries are a lot faster (if you find another way (preferably automatic) to warm up the caches, please leave us a note).
I also like this query showing all the graphs and how many triples are in them:

SELECT ?g COUNT(*) { GRAPH ?g {?s ?p ?o.} } GROUP BY ?g ORDER BY DESC 2;
-- g                                                    callret-1
-- VARCHAR                                              VARCHAR
-- ______________________________________________________________
--
-- http://dbpedia.org                                   257130074
-- http://localhost:8890/DAV                                 2067
-- http://www.openlinksw.com/schemas/virtrdf#                1813
-- http://www.w3.org/2002/07/owl#                             160
--
-- 4 Rows. -- 214173 msec.

Congratulations, you just imported over 250 million triples.

Final polishing

Isn’t this a good moment to backup the db? If you think so as well, see the next section. I did this backing up twice, just after all the triples were imported and after the following:
You can & should now install the DBpedia and RDF Mappers packages from the Virtuoso Conductor.
http://your-server:8890

login: dba
pw: dba

Go to System Admin / Packages. Install the rdf_mappers packages (takes about 5 minutes).
Download the dbpedia vad package from http://download.openlinksw.com/packages/6.1/virtuoso/, then upload & install it to Virtuoso (below the package list).
I used rdf_mappers: 1.24.35/ 2011-06-13 and dbpedia: 1.3.01/ 2010-07-10.

After this, there should be a couple more graphs in your DB:

SQL> sparql SELECT ?g COUNT(*) { GRAPH ?g {?s ?p ?o.} } GROUP BY ?g ORDER BY DESC 2;
-- g                                                           callret-1
-- VARCHAR                                                     VARCHAR
-- ______________________________________________________________________
--
-- http://dbpedia.org                                          257130074
-- http://www.openlinksw.com/schemas/RDF_Mapper_Ontology/1.0/  256057
-- http://localhost:8890/DAV                                   3675
-- http://www.openlinksw.com/schemas/virtrdf#                  1813
-- http://open.vocab.org/terms                                 1480
-- http://purl.org/ontology/bibo/                              1226
-- http://purl.org/goodrelations/v1                            937
-- http://purl.org/dc/terms/                                   857
-- http://www.openlinksw.com/schemas/google-base               691
-- http://xmlns.com/foaf/0.1/                                  557
-- http://rdfs.org/sioc/ns#                                    553
-- http://www.openlinksw.com/schemas/evri                      482
-- http://bblfish.net/work/atom-owl/2006-06-06/                386
-- http://scot-project.org/scot/ns#                            332
-- http://www.openlinksw.com/schemas/zillow                    311
-- http://www.w3.org/2004/02/skos/core                         252
-- http://www.openlinksw.com/schemas/cnet                      225
-- http://www.openlinksw.com/schemas/tesco                     183
-- http://www.openlinksw.com/schemas/bestbuy                   172
-- http://www.w3.org/2002/07/owl#                              160
-- http://www.w3.org/2002/07/owl                               160
-- http://www.openlinksw.com/schemas/amazon                    143
-- http://purl.org/dc/elements/1.1/                            139
-- http://www.w3.org/2007/05/powder-s#                         117
-- http://www.w3.org/2000/01/rdf-schema#                       87
-- http://www.w3.org/1999/02/22-rdf-syntax-ns#                 85
-- http://www.openlinksw.com/schemas/ebay                      79
-- virtrdf-label                                               56
-- http://www.openlinksw.com/schema/attribution#               51
-- dbprdf-label                                                6
--
-- 30 Rows. -- 215219 msec.

Depending on what you want to do with your db, you might want to increase the timeouts for SPARQL a lot.

Backing up this initial state

Now is a good moment to backup the whole db (takes about half an hour):

sudo -i
cd /
/etc/init.d/virtuoso-opensource stop &&
tar -cvf - /var/lib/virtuoso | gzip --fast > virtuoso-6.1.3-DBDUMP-dbpedia-3.6-en-$(date '+%F').tar.gz &&
/etc/init.d/virtuoso-opensource start

Yay, done ;)

Edit (Nov. 6th 2010): fixed escaping of html special chars (<,>,&) in code snippets.
Edit (Jun. 14th 2011): updated after importing DBpedia 3.6
Edit (Dec. 29th 2011): added a note that isql might be named isql-vt
Edit (May 25th 2012): link to version for DBpedia 3.7

53 thoughts on “Setting up a local DBpedia mirror with Virtuoso

  1. thomas

    Jörn, thank you so much for this great help, What do you mean by “Fire up screen if you didn’t already.” ?

    Reply
    1. joern Post author

      Hi Thomas,
      thanks for your reply. screen is a very useful command for server management, as it comes in very handy with ssh-connections.

      With screen you sort of start a virtual shell session which allows you to create “multiple sessions” (so multiple shells and switching between them in one connection), as well to “detach” and later on “reattach” to such sessions. This is very useful for long running tasks, as you can detach, let the task run, disconnect from the server, (task is still running), reconnect to the server and reattach to screen and find the still running task in the virtual session, as if you never disconnected.
      Sometimes people prefer twiddling with the nohup, but screen is a lot more comfortable. Also it protects you against dropping ssh-connections.
      For more information either use man screen or see this page.

      Greetings,
      Jörn

      Reply
  2. claudio

    Thanks Joern,

    you made my day, I was getting crazy with old scripts (load_nt.sh) etc!

    Reply
  3. Valentin Grouès

    Thanks a lot! I tried to follow your steps but I got a few errors during the import of some of the files. Three of them in fact:
    – page_links_en.nt.gz error 37000 SP029: TURTLE RDF loader , line 16661091: syntax error
    – long_abstracts_en.nt.gz error 37000 SP029: TURTLE RDF loader, line 3134668: Unterminated short double-quoted string at
    -infobox_properties_en.nt.gz error 37000 SP029: TURTLE RDF loader, line 16304120: syntax error

    I tried to had a look at the mentioned lines but I don’t see any problem for two of them. For long_abstracts, it seems that there is an unescaped quote.

    for the others:

    zcat page_links_en.nt.gz | head -n 15518997 | tail -n 1
    => <http://dbpedia.org/resource/Lloyd_Axworthy> <http://dbpedia.org/property/wikilink> <http://dbpedia.org/resource/Ottawa_Treaty> .
    zcat infobox_properties_en.nt.gz | head -n 3134668 | tail -n 1
    => <http://dbpedia.org/resource/Kevin_Rudd> <http://dbpedia.org/property/website> "PM.gov.au and KevinPM.com.au"@en .

    If someone has any idea to solve this, I would really appreciate because I am kind of stuck with this.

    Reply
    1. joern Post author

      Hi Valentin,
      are you using the DBpedia 3.5.1 dumps or 3.6?
      Maybe you can post a few lines around the problematic ones?

      Jörn

      Reply
      1. joern Post author

        I thought maybe the md5sums of the preprocessed files (so the filtered ones) might help you:

        $ md5sum page_links_en.nt.gz long_abstracts_en.nt.gz infobox_properties_en.nt.gz
        35d4b2579041dce6a4203cc6ddf9d8a4  page_links_en.nt.gz
        432126c46ecd17658de192433926b6aa  long_abstracts_en.nt.gz
        efef38a9ce51025af9a9bd3c3c67ff44  infobox_properties_en.nt.gz
        Reply
        1. Valentin Grouès

          Thanks for the md5.
          It’s strange but I have different md5 for each of those files. Maybe they changed the files on the dbpedia server. I am trying with the 3.6 release now. Hopefully it will solve the problems and I’ll have a more up to date version at the same time. I’ll keep you inform.

          Reply
  4. adam

    Has anyone been able to load the 3.6 dump successfully? I’m 1/2 way through the triples and have encoutered a few error also –

    15:53:41 PL LOG: File /usr/local/data/dbpedia/3.6/en/infobox_properties_en.nt.gz error 37000 SP029: TURTLE RDF loader, line 16965718: syntax error

    21:07:12 PL LOG: File /usr/local/data/dbpedia/3.6/en/long_abstracts_en.nt.gz error 37000 SP029: TURTLE RDF loader, line 3209624: Unterminated short double-quoted string at

    Wed May 18 2011

    Reply
    1. joern Post author

      I will have a look into updating to 3.6 soon. As i have heard that there are problems with the files multiple times now, it would be nice to know if someone already solved the syntax errors and/or unterminated string problems.

      Jörn

      Reply
    2. Valentin Grouès

      Hi, I finally managed to load the 3.6 dump. I had some unclear problems that I solved just decompressing the problematic files before importing them. Maybe you can just try this. Also, you will still have to solve the long urls problem like explained by Jörn.

      Reply
      1. samur araujo

        Hi Valentin, would you have a virtuoso dump of you dbpedia available for download? it would be heavily appreciated.

        Cheers,
        Samur Araujo

        Reply
  5. joern Post author

    Updated the post after importing DBpedia 3.6.
    Actually everything worked the same as with the old version, just that my first download seems to have been broken. Should you download the 3.6 all_languages.tar, here is its md5sum: 35907c55505b13b4091afbe816fcdba9 .
    Jörn

    Reply
  6. Nikhil

    I tried to import 3.7 en and there is a problem importing the infobox file. I get the following error in the log.

    11:48:33 PL LOG: File /usr/local/data/dbpedia/3.7/en/infobox_properties_en.nt.gz error 23000 SR133: Can not set NULL to not nullable column ‘DB.DBA.RDF_QUAD.P’

    11:48:39 PL LOG: File /usr/local/data/dbpedia/3.7/en/infobox_property_definitions_en.nt.gz error 23000 SR133: Can not set NULL to not nullable column ‘DB.DBA.RDF_QUAD.S’

    Any ideas how to fix it?

    Reply
    1. joern Post author

      AFAIK the DBpedia 3.7 dump is not officially released yet. I read several emails telling that the version which is currently online is broken. I’d suggest to use the 3.6 until the problems in 3.7 are fixed and the release is officially announced.

      Reply
      1. joern Post author

        As i found out the DBpedia 3.7 dumps use UTF-8 encoded IRIs instead of properly %-escaped URIs for all except the EN files. This isn’t valid ntriples but a turtle / n3 parser should be able to handle it. ATM I’m a bit undecided about this and we’re still discussing about this internally.

        Reply
        1. mataron

          Hi!
          Thanks for the nice & helpful tutorial! It did actually save me from lots of searching.

          Do you have any suggestion how to go about importing dbpedia 3.7? Is there any option in virtuoso to set the parser?

          The error I get is similar to the one posted above.

          Thanks in advance!

          Reply
  7. Olivier

    I have run the procedure on a Fedora system (use yum to search and install Virtuoso packages, you will save the compilation time for virtuoso)

    I have one problem left : procedures from the XPath NS are unkown. typically if I want to use the “concat” function in my SPARL request, I get the error “Virtuoso 42001 Error SR185: Undefined procedure DB.DBA.http://www.w3.org/2005/xpath-functions/#concat.”

    Any idea ?

    Reply
  8. DBPMirror

    Hi.

    I loaded a couple of DBPedia datasets into virtuoso, about 16 million triples. But queries for a specific resource takes more than 5 seconds to finish. ¿What is going to happen when I load the whole thing?
    ¿Is that the way that the actual DBPedia servers work, I mean, running through all the huge (and growing) dbpedia table?

    Thanks and regards.

    Reply
    1. joern Post author

      Usually virtuoso should have quite a complete set of indices, which means your queries should be quite fast. But as mentioned in the article this applies to a “warm” DB and you should provide enough RAM to the DB, else it might feel a bit laggy (from my experience).

      Reply
  9. Daniele

    Hey, great post!
    I’m getting crazy with Dbpedia dump & virtuoso, really!
    I’m trying to mirror Dbpedia 3.7 and successfully (I hope!) installed virtuoso 6.1.3.

    After a few minutes rdf_loader_run() has started I get the following error:
    “18:58:00 PL LOG: File /Dbpedia_datasets/global.graph error 37000 SP029: TURTLE RDF loader, line 1: Undefined namespace prefix at http://dbpedia.org
    18:58:00 Server received signal 11. Continuing with the default action for that signal.
    Segmentation fault”

    where global.graph is a file I created according to http://ods.openlinksw.com/wiki/main/Main/VirtBulkRDFLoaderExampleDbpedia and that contains only “http://dbpedia.org” and saved in the folder containing Dbpedia datasets.

    Do you have any idea or suggestion?

    Thanks in advance!
    Daniele

    Reply
  10. Roberto

    Any luck someone with DBPedia 3.7.?
    I’m getting “23000 SR133: Can not set NULL to not nullable column…” errors with infobox_properties_en.nt.gz and infobox_property_definitions_en.nt.gz
    despite I fixed URI-length using the code included in this post.

    Reply
  11. Sharath Avadhanam

    Hi Joern,

    I am working porting local dbpedia dumps to localhost via virtuoso. Your post was very helpful. However, isql throws on error “isql # enter virtuoso sql mode” on your page.
    It throw the following error
    “[IM002][unixODBC][Driver Manager]Data source name not found, and no default driver specified” when I run “isql -v 1111 dba dba “(isql doesn’t run without passing args). It probably has to do with odbc.ini file. Cna you share the setting on the odbc.ini file (after masking user and password fields)? or am I doing something wrong?

    Thanks,
    Sharath

    Reply
  12. Sharath Avadhanam

    Update: isql should be isql-vt on ubuntu. Figured that just now. Kindly ignore my older post. Nothing to do with odbc.ini. Also the captcha on the blog is very hard to read even with multiple reloads. Just wanted to let you know

    Reply
    1. joern Post author

      Thanks for the comment, but did you build virtuoso with the corresponding arguments as in the post? (If anyone else could confirm that isql was renamed to isql-vt i’d edit the post).
      Wrt. the captcha: sorry it’s just what re-captcha provides :-/

      Reply
          1. joern Post author

            hmm, i get the feeling that you didn’t install virtuoso from source but via the ubuntu package. Might be that it’s different there. Still i’ll update the post to make people aware that isql might be called isql-vt.
            thanks

  13. June

    Will rdf_loader_run() be more and more slower?
    I load dbpedia by 4 parts, and when loading the last part, it is getting extraordinarily slow…

    Reply
    1. joern Post author

      uhm, i never noticed this, as i always imported the dump in one shot.

      I noticed that things get terribly slow when you run out of RAM, so maybe that happens when you import the last part?
      Also I’d do commits and checkpoints between the parts, just to make sure there’s not too much pending work.

      Reply
  14. florian

    I really enjoyed following the tutorial!!!
    I’m not a linux-expert but I found it easy to set up and it ran through without any errors. Because I don’t have a fast enought server I set up a virtual machine with VirtuelBox. The Virtuell Machine was running on 800MB RAM and a pretty old DualCore. It took 3 days to import the german version of dbpedia but Queries on the SPARQL-Endpoint work. You might experience Errors when querying the Endpoint because the estimated execution time exceeds a predefined limit. I fixed this by logging into http://your-server:8890 and increasing the limits in the SPARQL section of the admin-interface. Now its time to play ;)

    Greetings,
    florian

    Reply
  15. Peter

    Hi Joern,

    thanks for the great tutorial! It saved my live! I don’t I’d have managed to get the server and the imports running without your valuable instructions.
    Just one quick question, maybe someone can help with regard to this: I’m running the import for 2 (!) days now, an roughly have imported half of the files. Although I’m running on a pretty slow machine (dual core, 3 Gig Memory) I’m wondering that neither CPU nor memory are significantly in use. It’s almost exclusively the HDD which is working all the time, causing the import to be so incredibly slow probably. Any ideas where to tune things in the configuration?
    My Virtuoso.ini config is curently set to but doesnt seem to have any effect though:
    NumberOfBuffers = 224000
    MaxDirtyBuffers = 200000

    Thanks a lot!
    Nico

    Reply
    1. joern Post author

      Hi Nico,
      with those settings virtuoso should use at least half of your ram and the rest should probably be OS buffers caching access to the hd… I’d just fire up top or htop and have a look on the virtuoso memory column. If it doesn’t use a significant amount of memory something is wrong. Are you sure you (re)started the virtuoso server after changing the buffer values? Did you set the MaxCheckpointRemap according to the post?
      Cheers,
      Jörn

      Reply
  16. Nico

    Ho Joern,
    thanks for your reply.
    Something seems to be going terribly wrong. Obviously, virtuoso is only using a small piece of the available memory, thus leaving everything else to the HDD. Checked with top, which is telling me the same (~250MB in use for virtuoso).
    It seems to me like virtuoso is ignoring any settings made in the virtuoso.ini. Looking into the output of the server status, I see the following:

    Database Status:
      File size 17060331520, 2082560 pages, 789300 free.
      2000 buffers, 1996 used, 0 dirty 0 wired down, repl age 573 0 w. io 0 w/crsr.
      Disk Usage: 8764 reads avg 0 msec, 0% r 0% w last  0 s, 3073 writes,
        29 read ahead, batch = 26.  Autocompact 0 in 0 out, 0% saved.

    So the number of buffers is defaulted to 2000 instead of the 230k I specified. Any idea what the reason for this could be? (ini file is in /var/lib/virtuoso/db)

    Reply
    1. joern Post author

      maybe there’s another install of virtuoso interfering with the one you actually want to use? (e.g., one installed by your distribution?)
      also see the comment about isql and isql-vt.

      Reply
    2. sherifkandeel

      I installed it on windows and set the values of the virtuoso.ini file manually from the text editor, also when I write status(); I do find the defaults values also , not the ones I mentioned in the .ini file :/

      Reply
      1. Nico

        Finally fixed it for me. As simple as it can be, the reason was that I had some trailing spaces in front of the lines with the buffer settings. Apparently, virtuoso ignores any parameters in the ini file that do not start at the very first character of a new line.
        May this helps in your case as well.
        Nico

        Reply
  17. sherifkandeel

    Impressive :)
    your tutorial is actually very good, But I am having a problem here:
    The code you showed to correct the three faulty .gz files, I run the code and instantly it says “done.” in no time, so I think it didn’t do anything (especially with the page_links_en.nt.gz)
    I am running the server on VM with 4GB RAM and an i7 Quad core processor…
    is ther’es something I might be doing wrong?

    is there’s a script to do this in windows instead ?

    Reply
  18. Chryssa

    Hi !! Your tutorial was great! I wish I had found it earlier!!
    I have loaded several dbpedia dumps but I have actually used the extracted versions.
    I am getting 23000 SR133: Cannot set NULL to not nullable column ‘DB.DBA.RDF_QUAD.O’ errors in several files due to URI-lengths of much larger than 1024.
    Could you indicate how to fix them for files that are not gziped? (the code you used works only with gzipped files….)

    Thanks a lot!!

    Reply
    1. joern Post author

      Well, just use cat instead of zcat or even better:

      grep -v -E '^<.+> <.+> <.{1025,}> .$' < $i | gzip --fast > ${i%.nt.gz}_cleaned.nt.gz &&

      Gives you gzipped files, as they’re better, believe it or not!

      Reply
  19. joern Post author

    well, just use cat instead of zcat or even better:

    grep -v -E '^<.+> <.+> <.{1025,}> .$' < $i | gzip --fast > ${i%.nt.gz}_cleaned.nt.gz &&

    Gives you gzipped files, as they’re better, believe it!

    Reply
    1. joern Post author

      Thanks for this feedback, nice to hear that there’s an option in the new versions. I’ll update the article soon as I’m currently importing the EN, DE DBpedia 3.7 .

      Reply
  20. Dan

    Jörn, thank you for the great tutorial! I successfully loaded the 3.6 dump version and now I’m fighting with the 3.7 version. I’m struggling because performance went dramatically down after loading 15GB out of 30GB in total.

    I’m using a 4Gb RAM machine on mac os and set virtuoso.ini:
    NumberOfBuffers = 300000
    MaxDirtyBuffers = 225000
    MaxCheckpointRemap = 75000

    The infobox .nt itself is 7Gb+! I also launched 3 parallel rdf_loader_run (); opening 3 different isql windows. If I check resources usage using “top” command, now virtuoso-t is using less than 10% of CPU whilst it was about 100% at the beginning.

    I’m not sure if everything is something close to to a stuck situation or whether I should leave it run anyway. Any tips on how to improve performance?

    Thanks!

    Reply
  21. Pingback: Loading dbPedia data into a local Virtuoso installation « TaaS – Thoughts as a Service

  22. Pingback: Setting up a local DBpedia 3.7 mirror with Virtuoso 6.1.5+ | Jörn's Blog

  23. Venkatesh

    Can you please mention the link to the thread again as this thread is not available currently?

    Reply
  24. Pingback: How to install OpenLink Virtuoso on a Mac | Tanya Gray Jones

  25. Pingback: Setting up a local DBpedia 3.9 mirror with Virtuoso 7 | Jörn's Blog

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>