# Setting up a Linked Data mirror from RDF dumps (DBpedia 2015-04, Freebase, Wikidata, LinkedGeoData, …) with Virtuoso 7.2.1 and Docker (optional)

So you’re the guy who is allowed to setup a local DBpedia mirror or more generally a local Linked Data mirror for your work group? OK, today is your lucky day and you’re in the right place. I hope you’ll be able to benefit from my many hours of trials and errors. If anything goes wrong (or everything works fine), feel free to leave a comment below.

## Versions of this guide

There are four older versions of this guide:

• Oct. 2010: The first version focusing on DBpedia 3.5 – 3.6 and Virtuoso 6.1
• May 2012: A bigger update to DBpedia 3.7 (new local language versions) and Virtuoso 6.1.5+ (with a lot of updates making pre-processing of the dumps easier)
• Apr. 2014: Update to DBpedia 3.9 and Virtuoso 7
• Nov. 2014: Update to DBpedia 2014 and other Datasets and Virtuoso 7.1.0

In this step by step guide I’ll tell you how to install a local Linked Data mirror of the DBpedia 2015-04, hosting a combination of the regular English and (exemplary) the i18n German datasets adding up to nearly 850 M triples.

I’ll also mention how you can add the following datasets / vocabularies adding up to nearly 6 G triples:

As DBpedia is quite modular and has many internationalized (i18n) versions it has its own section in this guide, the other datasets don’t, as they maximally need minor repacking and a single line to load as explained below.

## Used Versions

• DBpedia 2015-04
• Virtuoso OpenSource 7.2.1
• Ubuntu 14.04 LTS or Debian 8

## Prerequisites

A strong machine with root access and enough RAM: We used a VM with 4 Cores and 32 GBs of RAM for DBpedia only. If you intend to also load Freebase and other datasets i recommend at least 64 GBs of RAM (we actually ended up using a 16 Core, 256 GB RAM Server in our research group). For installing i recommend more than 128 GB free HD space for DBpedia alone, 512 GB if you want to load Freebase as well, especially for downloading and repacking the datasets, as well as the growing database file when importing (mine grew to 64 GBs for DBpedia and 320 GB with all the datasets mentioned above).

This guide applies to a clean install. Please check that there’s no older version of Virtuoso installed with dpkg -l | grep virtuoso ; which isql ; which isql-vt (no output is good). If there is, please know what you’re doing. Virtuoso 6 and 7 use different default locations for their DBs, but in general newer versions should be able to upgrade older DB files if correctly configured to use the same DB file. In general i’d suggest to either uninstall the older version and its config files and then install the new one according to this guide or to isolate the newer one with the docker approach mentioned below.

## For the impatient and docker affine

As an alternative to the following sections, which will explain how to build everything from source yourself and go into details about the DBpedia dump files, i also provide a docker image (source) that you can use to automate and simplify the process a lot:

dump_dir=~/dumps/dbpedia/2015-04
db_dir=~/virtuoso_db
mkdir -p "$dump_dir" cd "$dump_dir"

wget -r -nc -nH --cut-dirs=1 -np -l1 \
-A '*.nt.bz2' -A '*.owl' -R '*unredirected*' \

# repacking
apt-get install pigz pbzip2
for i in */*.nt.bz2 ; do echo $i ; pbzip2 -dc "$i" | pigz - > "${i%bz2}gz" && rm "$i"; done
mkdir classes
cd classes
cd

# install some VAD packages for DBpedia into our db which we'll keep in db_dir
docker run -d --name dbpedia-vadinst \
-v "$db_dir":/var/lib/virtuoso-opensource-7 \ joernhees/virtuoso run && docker exec dbpedia-vadinst wait_ready && docker exec dbpedia-vadinst isql-vt PROMPT=OFF VERBOSE=OFF BANNER=OFF \ "EXEC=vad_install('/usr/share/virtuoso-opensource-7/vad/rdf_mappers_dav.vad');" && docker exec dbpedia-vadinst isql-vt PROMPT=OFF VERBOSE=OFF BANNER=OFF \ "EXEC=vad_install('/usr/share/virtuoso-opensource-7/vad/dbpedia_dav.vad');" && docker stop dbpedia-vadinst && docker rm -v dbpedia-vadinst && # starting the import docker run --rm \ -v "$db_dir":/var/lib/virtuoso-opensource-7 \
-v "$dump_dir"/classes:/import:ro \ joernhees/virtuoso import 'http://dbpedia.org/resource/classes#' && # docker import of the actual data (will use 64 GB RAM and take about 1 hour) docker run --rm \ -v "$db_dir":/var/lib/virtuoso-opensource-7 \
-v "$dump_dir"/core:/import:ro \ -e "NumberOfBuffers=$((64*85000))" \
joernhees/virtuoso import 'http://dbpedia.org' &&

# running the local endpoint on port 8891 with 32 GB RAM:
docker run --name dbpedia \
-v "$db_dir":/var/lib/virtuoso-opensource-7 \ -p 8891:8890 \ -e "NumberOfBuffers=$((32*85000))" \
joernhees/virtuoso run

# access one of the following for example:
# http://localhost:8891/sparql
# http://localhost:8891/resource/Bonn
# http://localhost:8891/conductor (user: dba, pw: dba)


## The manual version

We’ll download Virtuoso OpenSource: either from SourceForge or GitHub (make sure you get v7.2.1 as in this guide or a newer version).

Unlike in earlier versions of this guide we’ll now first build the .deb packages and then install them with apt-get.

As building will install a lot of extra packages that you only need for building, i prepared another docker image (source) that will do the whole building job inside a container for you and put the resulting .deb packages (and DBpedia VAD) into your ~/virtuoso_deb folder:

docker run --rm -it -v ~/virtuoso_deb:/export/ joernhees/dpkg_build \
-j5
# this should run for about 15 minutes
# compilation by default sadly does not create the dbpedia VAD package, so
# to do that, the above command stops after compilation in interactive mode.
# in there just execute this:
cd /tmp/build/virtuoso*/ &&
cd binsrc &&
make &&
exit


If you used this, you can skip the following down to installing the .deb packages.

If not, to do the building manually run this to download the file, put it in your home dir on the server, then extract it and switch to the directory:

mkdir ~/virtuoso_deb
cd ~/virtuoso_deb
tar -xvzf virtuoso-7.2.1.tar.gz
cd virtuoso-opensource-7.2.1  # or newer, depending what you got


Afterwards you can use the following to install the build dependencies and actually build the .deb packages:

# install build tools
sudo apt-get install -y build-essential devscripts
# to install Virtuoso build dependencies
mk-build-deps -irt'apt-get --no-install-recommends -yV' && dpkg-checkbuilddeps
# to build Virtuoso with 5 processes in parallel
# choose something like your server's #CPUs + 1
dpkg-buildpackage -us -uc -5


This will take about 15 min.
Afterwards if everything worked out, you should have the *.deb files in ~/virtuoso_deb.

We continue to also build the DBpedia VAD:

./configure --with-layout=debian --enable-dbpedia-vad && \
cd binsrc && make \


Finally, let’s create a small local repository out of the .deb files you just built. The advantage of this is that you can simply install virtuoso-server with its dependencies with apt. In theory you could also resolve them manually and install everything with dpkg -i ..., but where’s the fun in that?

cd ~/virtuoso_deb
dpkg-scanpackages ./ | gzip > Packages.gz


### Installing Virtuoso

No matter if you used the docker or manual building approach for the .deb packages of Virtuoso, you should now be able to install them with apt-get install ... after telling it where to look for the files for example by doing this:

sudo echo "deb file:~/virtuoso_deb ./" >> /etc/apt/sources.list.d/virtuoso_local_packages.list
sudo apt-get update


After this just install Virtuoso with the following command (it should warn you about untrusted sources of the Virtuoso packages, which is because we just built them ourselves):

sudo apt-get install virtuoso-server \


Installing the VAD packages here will actually not install them in the Virtuoso DB file, but just move them in the right place so they can for example be installed as mentioned later.

To also move the DBpedia VAD in place for later you can just run this:

sudo cp ~/virtuoso_deb/dbpedia_dav.vad /usr/share/virtuoso-opensource-7/vad/


### Configuring Virtuoso

Now change the following values in /etc/virtuoso-opensource-7/virtuoso.ini, the performance tuning stuff is according to http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtRDFPerformanceTuning:

# note: Virtuoso ignores lines starting with whitespace and stuff after a ;
[Parameters]
# to, in our case /usr/local/data/datasets:
# IMPORTANT: for performance also do this
[Parameters]
# the following two are as suggested by comments in the original .ini
# file in order to use the RAM on your server:
NumberOfBuffers = 2720000
MaxDirtyBuffers = 2000000
# each buffer caches a 8K page of data and occupies approx. 8700 bytes of
# memory. It's suggested to set this value to 65 % of ram for a db only server
# so if you have 32 GB of ram: 32*1000^3*0.65/8700 = 2390804
# default is 2000 which will use 16 MB ram ;)
# Make sure to remove whitespace if you uncomment existing lines!
[Database]
MaxCheckpointRemap = 625000
# set this to 1/4th of NumberOfBuffers
[SPARQL]
# I like to increase the ResultSetMaxrows, MaxQueryCostEstimationTime
# and MaxQueryExecutionTime drastically as it's a local store where we
# do quite complex queries... up to you (don't do this if a lot of people
# use it).
# In any case for the importer to be more robust add the following setting
# to this section:
ShortenLongURIs = 1


Afterwards restart Virtuoso:

sudo /etc/init.d/virtuoso-opensource-7 stop


You should now have a running Virtuoso server.

### DBpedia URIs (en) vs. DBpedia IRIs (i18n)

The DBpedia 2015-04 consists of several datasets: one “standard” English version and several localized versions for other languages (i18n). The standard version mints URIs by going through all English Wikipedia articles. For all of these the Wikipedia cross-language links are used to extract corresponding labels in other languages for the en URIs (e.g., core/labels-en-uris_de.nt.bz2). This is problematic as for example articles which are only in the German Wikipedia won’t be extracted. To solve this problem the i18n versions exists and create IRIs in the form of de.dbpedia.org for every article in the German Wikipedia (e.g., core-i18n/de/labels_de.nt.bz2).

This approach has several implications. For backwards compatibility reasons the standard DBpedia makes statements about URIs such as http://dbpedia.org/resource/Gerhard_Schr%C3%B6der while the local chapters, like the German one, make statements about IRIs such as http://de.dbpedia.org/resource/Gerhard_Schröder (note the ö). In other words and as written above: the standard DBpedia uses URIs to identify things, while the localized versions use IRIs. This also means that http://dbpedia.org/resource/Gerhard_Schröder shouldn’t work. That said, clicking the link will actually work as there is magic going on in your browser to give you what you probably meant. Using curl curl -i -L -H "Accept: application/rdf+xml" http://dbpedia.org/resource/Gerhard_Schröder or SPARQLing the endpoint will nevertheless not be so nice/sloppy and can cause quite some headache. Observe how the following two SPARQL queries return different results: select * where { dbpedia:Gerhard_Schröder ?p ?o. } vs. select * where { <http://dbpedia.org/resource/Gerhard_Schr%C3%B6der> ?p ?o. }. In order to mitigate this historic problem a bit DBpedia actually offers owl:sameAs links from IRIs to URIs: core/iri-same-as-uri_en.nt.bz2 which you should load, so you at least have a link to what you want if someone tries to get info about an IRI.

As the standard DBpedia provides labels, abstracts and a couple other things in several languages, there are two types of files in the localized DBpedia folders: There are triples directly associating the English URIs with for example the German labels ({core,core-i18n/de}/labels-en-uris_de.nt.bz2) and there are the localized triple files which associate for example the DE IRIs with the German labels (core-i18n/de/labels_de.nt.bz2).

For our group we decided that we wanted a reasonably complete mirror of the standard DBpedia (EN) (have a look at the core directory, which contains all datasets loaded into the public DBpedia SPARQL Endpoint), but also the i18n versions for the German DBpedia loaded in separate graphs, as well as each of their pagelink datasets in yet another separate graph each. For this we download the corresponding files in (NT) format as follows. If you need something different do so (and maybe report back if there were problems and how you solved them).

# see comment above, you could also get another DBpedia version...
mkdir -p /usr/local/data/datasets/dbpedia/2015-04
cd /usr/local/data/datasets/dbpedia/2015-04
wget -r -nc -nH --cut-dirs=1 -np -l1 -A '*.nt.bz2' -A '*.owl' -R '*unredirected*' http://downloads.dbpedia.org/2015-04/{core/,core-i18n/en,core-i18n/de,dbpedia_2015-04.owl}


As already mentioned, the DBpedia 2015-04 introduced a core folder which contains all files loaded on the public DBpedia endpoint. Be aware that if you download other folders like above you’ll be downloading some files twice in other folders (e.g., labels-en-uris_de.nt.bz2 can be found in both, the core folder and the core-i18n/de folder). Quite obvious, but especially the core-i18n/en folder contains very many duplicate files from core. If want to see which downloaded files are duplicates (independent of their name) and especially which core-i18n/en files were not loaded on the public endpoint, so are not in core, you can do the following:

# compute md5 hashes for all downloaded files
find . -mindepth 2 -type f -print0 | xargs -0 md5sum > md5sums

# first check if there are duplicates in other folders without core
LC_ALL=C sort md5sums | grep -v '/core/' | uniq -w32 -D
ba3fc042b14cb41e6c4282a6f7c45e02  ./core-i18n/en/instance-types-dbtax-dbo_en.nt.bz2
ba3fc042b14cb41e6c4282a6f7c45e02  ./core-i18n/en/instance_types_dbtax-dbo.nt.bz2


So it seems the ./core-i18n/en/instance-types-dbtax-dbo_en.nt.bz2 and ./core-i18n/en/instance_types_dbtax-dbo.nt.bz2 files are actually the same.

To list all the files in core-i18n/en which are duplicates do this:

# list all dup files in core-i18n/en
LC_ALL=C sort md5sums | uniq -w32 -D | grep '/core-i18n/en'
068975f6dd60f29d13c8442b0dbe403d  ./core-i18n/en/skos-categories_en.nt.bz2
14a770f293524a5713f741a1a448bcfa  ./core-i18n/en/short-abstracts_en.nt.bz2
1958649209bc90944c65eccd30d37c6c  ./core-i18n/en/infobox-property-definitions_en.nt.bz2
3b42f351fc30f6b6b97d3f2a16ef6db3  ./core-i18n/en/instance-types-transitive_en.nt.bz2
3b61b11bdcb50a0d44ca8f4bd68f4762  ./core-i18n/en/revision-ids_en.nt.bz2
43a8b17859c50d37f4cab83573c2992e  ./core-i18n/en/instance_types_sdtyped-dbo_en.nt.bz2
4c847b2754294c555236d09485200435  ./core-i18n/en/instance-types_en.nt.bz2
63e2cde88e7bdefb6739c62aa234fc1e  ./core-i18n/en/category-labels_en.nt.bz2
75f2d135459c824feee1d427e4165a4f  ./core-i18n/en/transitive-redirects_en.nt.bz2
82fe80c3868a89d54fec26c919a4fa50  ./core-i18n/en/revision-uris_en.nt.bz2
8407c84d262b573418326bdd8f591b95  ./core-i18n/en/mappingbased-properties_en.nt.bz2
9152e34db96df2dd4991e78b7e53ff3f  ./core-i18n/en/article-categories_en.nt.bz2
94b48e9df78f746e60a9d0c1aafa3241  ./core-i18n/en/infobox-properties_en.nt.bz2
a254ce4596d045cc047959831edd318a  ./core-i18n/en/disambiguations_en.nt.bz2
ab29899e43fab1c6f060cdb8955c5b19  ./core-i18n/en/images_en.nt.bz2
ae046e03be0cf29eac1e3b8a8b3d6b03  ./core-i18n/en/persondata_en.nt.bz2
b4710d36b8dc915f07f5cec2d9971a27  ./core-i18n/en/page-ids_en.nt.bz2
ba3fc042b14cb41e6c4282a6f7c45e02  ./core-i18n/en/instance-types-dbtax-dbo_en.nt.bz2
ba3fc042b14cb41e6c4282a6f7c45e02  ./core-i18n/en/instance_types_dbtax-dbo.nt.bz2
bd90ce4064a120794b5eb5a8d024a97d  ./core-i18n/en/long-abstracts_en.nt.bz2
e4c422d1d23c69eff3b9d7d7df3f2f80  ./core-i18n/en/homepages_en.nt.bz2
eafc557cde69fd1cd8f78565c385ee16  ./core-i18n/en/iri-same-as-uri_en.nt.bz2
ef48deae48c9c9c5e17585e3f0243663  ./core-i18n/en/labels_en.nt.bz2
fa8800165c7e80509a4ebddc5f0caf90  ./core-i18n/en/geo-coordinates_en.nt.bz2

# to delete the duplicates from /core-i18n/en, leaving just one of each:
LC_ALL=C sort md5sums | uniq -w32 -D | grep '/core-i18n/en' | uniq -w32 | cut -d' ' -f3 | xargs rm

# afterwards these should be left:
ls -1 core-i18n/en
core-i18n/en/anchor-text_en.nt.bz2
core-i18n/en/article-templates_en.nt.bz2
core-i18n/en/genders_en.nt.bz2
core-i18n/en/instance_types_dbtax-dbo.nt.bz2
core-i18n/en/instance_types_dbtax_ext.nt.bz2
core-i18n/en/instance_types_lhd_dbo_en.nt.bz2
core-i18n/en/instance_types_lhd_ext_en.nt.bz2
core-i18n/en/out-degree_en.nt.bz2
core-i18n/en/page-length_en.nt.bz2
core-i18n/en/pnd_en.nt.bz2
core-i18n/en/redirects_en.nt.bz2
core-i18n/en/topical-concepts_en.nt.bz2


As Virtuoso can only import plain (uncompressed) or gzipped files, but the DBpedia dumps are bzipped, you can either repack them into gzip format or extract them. On our server the importing procedure was reasonably slower from extracted files than from gzipped ones (ignoring the vast amount of wasted disk space for the extracted files). File access becomes a bottleneck if you have a couple of cores idling. This is why I decided on repacking all the files from bz2 to gz. As you can see I do the repacking with the parallel versions of bz2 and gz. If that’s not suitable for you, feel free to change it. You might also want to change this if you want to do it in parallel to downloading. The repackaging process below took about 30 minutes but was worth it in the end. The more CPUs you have, the more you can parallelize this process.

# if you want to save space do this:
apt-get install pigz pbzip2
for i in core/*.nt.bz2 core-i18n/*/*.nt.bz2 ; do echo $i ; pbzip2 -dc "$i" | pigz - > "${i%bz2}gz" && rm "$i" ; done

# else do:
#pbzip2 */*.bz2

# notice that the extraction (and repacking) of *.bz2 takes quite a while (about 30 minutes)
# gzipped data is reasonably packed, but still very fast to access (in contrast to bz2), so maybe this is the best choice.


### Data Cleaning and The bulk loader scripts

In contrast to the previous versions of this article the Virtuoso import will take care of shortening too long IRIs itself. Also it seems the bulk loader script is included in the more recent Virtuoso versions, so as a reference only: see the old version for the cleaning script and VirtBulkRDFLoaderExampleDbpedia and

### Importing DBpedia dumps into Virtuoso

Now AFTER the re-/unpacking of the DBpedia dumps we will register all files in the DBpedia dir (recursively ld_dir_all) to be added to the DBpedia graph. If you use this method make sure that only files reside in the given subtree that you really want to import.
Also don’t forget to import the dbpedia_2015-04.owl file!
If you only want one directory’s files to be added (non recursive) use ld_dir('dir', '*.*', 'graph');.
If you manually want to add some files, use ld_add('file', 'graph');.
See the VirtBulkRDFLoaderScript file for details.

Be warned that it might be a bad idea to import the normal and i18n dataset into the same graph if you didn’t select specific languages, as it might introduce a lot of duplicates that are hard to disentangle.

In order to keep track (and easily reproduce) what was selected and imported into which graph, I actually link (ln -s) the repacked files into a directory structure beneath /usr/local/data/datasets/dbpedia/2015-04/importedGraphs/ and import from there instead. To make sure you think about this, I use that path below, so it won’t work if you didn’t pay attention. If you really want to import all downloaded files, just import /usr/local/data/datasets/dbpedia/2015-04/.

Also be aware of the fact that if you load certain parts of dumps in different graphs (such as I did with the pagelinks, as well as the i18n versions of the DE and FR datasets) that only triples from the http://dbpedia.org graph will be shown when you visit the local pages with your browser (SPARQL is unaffected by this)!

So if you only want to load the same datasets as loaded on the official endpoint then importing the core folder (first section below) and dbpedia_2015-04.owl file should be enough.

cd /usr/local/data/datasets/dbpedia/2015-04/
mkdir importedGraphs
cd importedGraphs

mkdir dbpedia.org
cd dbpedia.org
# ln -s ../../dbpedia*.owl ./  # see below!
ln -s ../../core/*.nt.gz ./
cd ..

mkdir ext.dbpedia.org
cd ext.dbpedia.org
ln -s ../../core-i18n/en/anchor-text_en.nt.gz ./
ln -s ../../core-i18n/en/article-templates_en.nt.gz ./
ln -s ../../core-i18n/en/genders_en.nt.gz ./
ln -s ../../core-i18n/en/instance_types_dbtax-dbo.nt.gz ./
ln -s ../../core-i18n/en/instance_types_dbtax_ext.nt.gz ./
ln -s ../../core-i18n/en/instance_types_lhd_dbo_en.nt.gz ./
ln -s ../../core-i18n/en/instance_types_lhd_ext_en.nt.gz ./
ln -s ../../core-i18n/en/out-degree_en.nt.gz ./
ln -s ../../core-i18n/en/page-length_en.nt.gz ./
cd ..

cd ..

mkdir topicalconcepts.dbpedia.org
cd topicalconcepts.dbpedia.org
ln -s ../../core-i18n/en/topical-concepts_en.nt.gz ./
cd ..

mkdir de.dbpedia.org
cd de.dbpedia.org
ln -s ../../core-i18n/de/article-categories_de.nt.gz ./
ln -s ../../core-i18n/de/article-templates_de.nt.gz ./
ln -s ../../core-i18n/de/category-labels_de.nt.gz ./
ln -s ../../core-i18n/de/disambiguations_de.nt.gz ./
ln -s ../../core-i18n/de/geo-coordinates_de.nt.gz ./
ln -s ../../core-i18n/de/homepages_de.nt.gz ./
ln -s ../../core-i18n/de/images_de.nt.gz ./
ln -s ../../core-i18n/de/infobox-properties_de.nt.gz ./
ln -s ../../core-i18n/de/infobox-property-definitions_de.nt.gz ./
ln -s ../../core-i18n/de/instance-types_de.nt.gz ./
ln -s ../../core-i18n/de/instance_types_lhd_dbo_de.nt.gz ./
ln -s ../../core-i18n/de/instance_types_lhd_ext_de.nt.gz ./
ln -s ../../core-i18n/de/instance-types-transitive_de.nt.gz ./
ln -s ../../core-i18n/de/iri-same-as-uri_de.nt.gz ./
ln -s ../../core-i18n/de/labels_de.nt.gz ./
ln -s ../../core-i18n/de/long-abstracts_de.nt.gz ./
ln -s ../../core-i18n/de/mappingbased-properties_de.nt.gz ./
ln -s ../../core-i18n/de/out-degree_de.nt.gz ./
ln -s ../../core-i18n/de/page-ids_de.nt.gz ./
ln -s ../../core-i18n/de/page-length_de.nt.gz ./
ln -s ../../core-i18n/de/persondata_de.nt.gz ./
ln -s ../../core-i18n/de/pnd_de.nt.gz ./
ln -s ../../core-i18n/de/revision-ids_de.nt.gz ./
ln -s ../../core-i18n/de/revision-uris_de.nt.gz ./
ln -s ../../core-i18n/de/short-abstracts_de.nt.gz ./
ln -s ../../core-i18n/de/skos-categories_de.nt.gz ./
ln -s ../../core-i18n/de/specific-mappingbased-properties_de.nt.gz ./
ln -s ../../core-i18n/de/transitive-redirects_de.nt.gz ./
cd ..

cd ..


This should have prepared your importedGraphs directory. From this directory you can run the following command which prints out the necessary isql-vt commands to register your graphs for importing:

for g in * ; do echo "ld_dir_all('$(pwd)/$g', '*.*', 'http://$g');" ; done  One more thing (thanks to Romain): In order for the DBpedia.vad package (which is installed at the end) to work correctly, the dbpedia_2014.owl file needs to be imported into graph http://dbpedia.org/resource/classes#. Note: In the following i will assume that your Virtuoso isql command is called isql-vt. If you’re in lack of such a command, it might be called isql or isql-v, but this usually means you installed it using some other method than described in here isql-vt # enter Virtuoso isql mode  -- we are in sql mode now ld_add('/usr/local/data/datasets/remote/dbpedia/2015-04/dbpedia_2015-04.owl', 'http://dbpedia.org/resource/classes#'); ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org', '*.*', 'http://dbpedia.org'); ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org', '*.*', 'http://de.dbpedia.org'); ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/ext.dbpedia.org', '*.*', 'http://ext.dbpedia.org'); ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/pagelinks.dbpedia.org', '*.*', 'http://pagelinks.dbpedia.org'); ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/pagelinks.de.dbpedia.org', '*.*', 'http://pagelinks.de.dbpedia.org'); ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/topicalconcepts.dbpedia.org', '*.*', 'http://topicalconcepts.dbpedia.org'); -- do the following to see which files were registered to be added: select * from DB.DBA.LOAD_LIST; -- if unsatisfied use: -- delete from DB.DBA.LOAD_LIST; EXIT;  You can now also register other datasets like Freebase, DBLP, Yago, Umbel and Schema.org … that you want to be loaded after downloading them to the appropriate directories like this: ld_add('/usr/local/data/datasets/remote/schema.org/2015-11-04/all.nt', 'http://schema.org'); ld_dir_all('/usr/local/data/datasets/remote/umbel/External Ontologies', '*.n3', 'http://umbel.org/umbel/rc'); ld_add('/usr/local/data/datasets/remote/umbel/Ontology/umbel.n3', 'http://umbel.org/umbel'); ld_add('/usr/local/data/datasets/remote/umbel/Reference Structure/umbel_reference_concepts.n3', 'http://umbel.org/umbel/rc'); ld_add('/usr/local/data/datasets/remote/yago/yago3/2015-11-04/yagoLabels.ttl.gz', 'http://yago-knowledge.org/resource'); ld_add('/usr/local/data/datasets/remote/dblp/l3s/2015-11-04/dblp.nt.gz', 'http://dblp.l3s.de'); ld_dir_all('/usr/local/data/datasets/remote/wikidata/tools.wmflabs.org/wikidata-exports/rdf/exports/20151026', '*.nt.gz', 'http://www.wikidata.org'); ld_dir_all('/usr/local/data/datasets/remote/freebase/2015-08-09', '*.nt.gz', 'http://rdf.freebase.com'); ld_dir_all('/usr/local/data/datasets/remote/linkedgeodata/2014-09-09', '*.*', 'http://linkedgeodata.org');  Our full DB.DBA.LOAD_LIST currently looks like this: select ll_graph, ll_file from DB.DBA.LOAD_LIST;  ll_graph ll_file VARCHAR VARCHAR NOT NULL ____________________________________ http://dblp.l3s.de /usr/local/data/datasets/remote/dblp/l3s/2015-11-04/dblp.nt.gz http://dbpedia.org/resource/classes# /usr/local/data/datasets/remote/dbpedia/2015-04/dbpedia_2015-04.owl http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/amsterdammuseum_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/article-categories_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/bbcwildlife_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/bookmashup_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/bricklink_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/category-labels_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/cordis_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/dailymed_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/dblp_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/dbpedia_2015-04.owl http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/dbtune_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/disambiguations_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/diseasome_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/drugbank_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/eunis_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/eurostat_linkedstatistics_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/eurostat_wbsg_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/external-links_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/factbook_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/flickrwrappr_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/freebase-links_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/gadm_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/geo-coordinates_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/geonames_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/geonames_links_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/geospecies_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/gho_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/gutenberg_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/homepages_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/images_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/infobox-properties_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/infobox-property-definitions_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/instance-types-transitive_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/instance-types_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/instance_types_sdtyped-dbo_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/interlanguage-links-chapters_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/iri-same-as-uri_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/italian_public_schools_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/labels-en-uris_ar.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/labels-en-uris_de.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/labels-en-uris_es.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/labels-en-uris_fr.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/labels-en-uris_it.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/labels-en-uris_ja.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/labels-en-uris_nl.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/labels-en-uris_pl.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/labels-en-uris_pt.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/labels-en-uris_ru.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/labels-en-uris_zh.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/labels_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/linkedgeodata_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/linkedmdb_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/lobid.org-manifestation.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/lobid.org-organization.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/long-abstracts-en-uris_ar.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/long-abstracts-en-uris_de.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/long-abstracts-en-uris_es.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/long-abstracts-en-uris_fr.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/long-abstracts-en-uris_it.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/long-abstracts-en-uris_ja.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/long-abstracts-en-uris_nl.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/long-abstracts-en-uris_pl.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/long-abstracts-en-uris_pt.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/long-abstracts-en-uris_ru.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/long-abstracts-en-uris_zh.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/long-abstracts_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/mappingbased-properties_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/musicbrainz_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/nuts_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/nytimes_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/opencyc_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/openei_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/page-ids_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/persondata_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/revision-ids_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/revision-uris_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/revyu_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/short-abstracts-en-uris_ar.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/short-abstracts-en-uris_de.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/short-abstracts-en-uris_es.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/short-abstracts-en-uris_fr.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/short-abstracts-en-uris_it.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/short-abstracts-en-uris_ja.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/short-abstracts-en-uris_nl.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/short-abstracts-en-uris_pl.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/short-abstracts-en-uris_pt.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/short-abstracts-en-uris_ru.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/short-abstracts-en-uris_zh.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/short-abstracts_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/sider_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/skos-categories_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/specific-mappingbased-properties_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/tcm_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/transitive-redirects_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/transparency_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/uk-university_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/umbel_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/uscensus_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/viaf_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/wikicompany_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/wikipedia-links_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/wordnet_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/yago_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/yago_taxonomy.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/yago_type_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/dbpedia.org/yago_types.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/article-categories_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/article-templates_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/category-labels_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/disambiguations_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/external-links_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/freebase-links_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/geo-coordinates_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/geonames_links_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/homepages_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/images_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/infobox-properties_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/infobox-property-definitions_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/instance-types-transitive_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/instance-types_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/instance_types_lhd_dbo_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/instance_types_lhd_ext_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/interlanguage-links-chapters_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/interlanguage-links_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/iri-same-as-uri_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/labels_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/long-abstracts_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/mappingbased-properties_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/out-degree_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/page-ids_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/page-length_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/persondata_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/pnd_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/revision-ids_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/revision-uris_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/short-abstracts_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/skos-categories_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/specific-mappingbased-properties_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/transitive-redirects_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/de.dbpedia.org/wikipedia-links_de.nt.gz http://ext.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/ext.dbpedia.org/anchor-text_en.nt.gz http://ext.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/ext.dbpedia.org/article-templates_en.nt.gz http://ext.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/ext.dbpedia.org/genders_en.nt.gz http://ext.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/ext.dbpedia.org/instance_types_dbtax-dbo.nt.gz http://ext.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/ext.dbpedia.org/instance_types_dbtax_ext.nt.gz http://ext.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/ext.dbpedia.org/instance_types_lhd_dbo_en.nt.gz http://ext.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/ext.dbpedia.org/instance_types_lhd_ext_en.nt.gz http://ext.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/ext.dbpedia.org/out-degree_en.nt.gz http://ext.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/ext.dbpedia.org/page-length_en.nt.gz http://pagelinks.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/pagelinks.dbpedia.org/page-links_en.nt.gz http://pagelinks.de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/pagelinks.de.dbpedia.org/page-links_de.nt.gz http://topicalconcepts.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2015-04/importedGraphs/topicalconcepts.dbpedia.org/topical-concepts_en.nt.gz http://rdf.freebase.com /usr/local/data/datasets/remote/freebase/2015-08-09/fb2w.nt.gz http://rdf.freebase.com /usr/local/data/datasets/remote/freebase/2015-08-09/freebase-rdf-2015-08-09-00-01.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Abutters.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Abutters.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-AerialwayThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-AerialwayThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-AerowayThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-AerowayThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Amenity.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Amenity.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-BarrierThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-BarrierThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Boundary.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Boundary.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Craft.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Craft.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-CyclewayThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-CyclewayThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-EmergencyThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-EmergencyThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-HistoricThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-HistoricThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Leisure.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Leisure.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-LockThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-LockThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-ManMadeThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-ManMadeThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-MilitaryThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-MilitaryThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Office.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Office.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Place.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Place.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-PowerThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-PowerThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-PublicTransportThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-PublicTransportThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-RailwayThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-RailwayThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-RouteThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-RouteThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Shop.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-Shop.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-SportThing.node.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-SportThing.way.sorted.nt.gz http://linkedgeodata.org /usr/local/data/datasets/remote/linkedgeodata/2014-09-09/2014-09-09-ontology.sorted.nt.gz http://schema.org /usr/local/data/datasets/remote/schema.org/2015-11-04/all.nt http://umbel.org/umbel/rc /usr/local/data/datasets/remote/umbel/External Ontologies/dbpedia-ontology.n3 http://umbel.org/umbel/rc /usr/local/data/datasets/remote/umbel/External Ontologies/geonames.n3 http://umbel.org/umbel/rc /usr/local/data/datasets/remote/umbel/External Ontologies/opencyc.n3 http://umbel.org/umbel/rc /usr/local/data/datasets/remote/umbel/External Ontologies/same-as.n3 http://umbel.org/umbel/rc /usr/local/data/datasets/remote/umbel/External Ontologies/schema.org.n3 http://umbel.org/umbel/rc /usr/local/data/datasets/remote/umbel/External Ontologies/wikipedia.n3 http://umbel.org/umbel /usr/local/data/datasets/remote/umbel/Ontology/umbel.n3 http://umbel.org/umbel/rc /usr/local/data/datasets/remote/umbel/Reference Structure/umbel_reference_concepts.n3 http://www.wikidata.org /usr/local/data/datasets/remote/wikidata/tools.wmflabs.org/wikidata-exports/rdf/exports/20151026/wikidata-instances.nt.gz http://www.wikidata.org /usr/local/data/datasets/remote/wikidata/tools.wmflabs.org/wikidata-exports/rdf/exports/20151026/wikidata-properties.nt.gz http://www.wikidata.org /usr/local/data/datasets/remote/wikidata/tools.wmflabs.org/wikidata-exports/rdf/exports/20151026/wikidata-property-taxonomy.nt.gz http://www.wikidata.org /usr/local/data/datasets/remote/wikidata/tools.wmflabs.org/wikidata-exports/rdf/exports/20151026/wikidata-simple-statements.nt.gz http://www.wikidata.org /usr/local/data/datasets/remote/wikidata/tools.wmflabs.org/wikidata-exports/rdf/exports/20151026/wikidata-sitelinks.nt.gz http://www.wikidata.org /usr/local/data/datasets/remote/wikidata/tools.wmflabs.org/wikidata-exports/rdf/exports/20151026/wikidata-statements.nt.gz http://www.wikidata.org /usr/local/data/datasets/remote/wikidata/tools.wmflabs.org/wikidata-exports/rdf/exports/20151026/wikidata-taxonomy.nt.gz http://www.wikidata.org /usr/local/data/datasets/remote/wikidata/tools.wmflabs.org/wikidata-exports/rdf/exports/20151026/wikidata-terms.nt.gz http://yago-knowledge.org/resource /usr/local/data/datasets/remote/yago/yago3/2015-11-04/yagoLabels.ttl.gz 219 Rows. -- 8 msec.  OK, now comes the fun (and long part: about 1.5 hours (new Virtuoso 7 is cool 😉 for DBpedia alone, +~6 hours for Freebase)… After we registered the files to be added, now let’s finally start the process. Fire up screen if you didn’t already. (For more detailed metering than below see VirtTipsAndTricksGuideLDMeterUtility.) sudo apt-get install screen screen isql-vt  rdf_loader_run(); -- DO NOT USE THE DB BESIDES THE FOLLOWING COMMANDS: -- depending on the amount of CPUs and your IO performance you can run -- more rdf_loader_run(); commands in other isql-vt sessions which will -- speed up the import process. -- you can watch the progress from another isql-vt session with: -- select * from DB.DBA.LOAD_LIST; -- if you need to stop the loading for any reason: rdf_load_stop(); -- if you want to force stopping: rdf_load_stop(1); checkpoint; commit work; checkpoint; EXIT;  After this: Take a look into var/lib/virtuoso/db/virtuoso.log and run this: isql-vt BANNER=OFF VERBOSE=OFF 'EXEC=SELECT * FROM DB.DBA.LOAD_LIST WHERE ll_error IS NOT NULL;'  Should you find any errors in there… FIX THEM! You might be able to use the dump, but it’s incomplete in those cases. Any error quits out of the loading of the corresponding file and continues with the next one, so you’re only using the part of that file up to the place where the error occurred. (Should you find errors you can’t fix, please leave a comment.) ### Final polishing You can & should now install the DBpedia and RDF Mappers packages from the Virtuoso Conductor. http://your-server:8890 login: dba pw: dba  Go to System Admin / Packages. Install the DBpedia (v. 1.4.30) and rdf_mappers (v. 1.34.74) packages (takes about 5 minutes). ### Testing your local mirror Go to the sparql-endpoint of your server http://your-server:8890/sparql (or in isql-vt prefix with: SPARQL) sparql SELECT count(*) WHERE { ?s ?p ?o } ;  This shouldn’t take long in Virtuoso 7 anymore and for me now returns 849,521,186 for DBpedia (en+de) or 5,959,006,725 with all the datasets mentioned above. I also like this query showing all the graphs and how many triples are in them: sparql SELECT ?g COUNT(*) as ?c { GRAPH ?g {?s ?p ?o.} } GROUP BY ?g ORDER BY DESC(?c); g c LONG VARCHAR LONG VARCHAR __________________________________________________________ http://rdf.freebase.com 3126890738 http://linkedgeodata.org 1013866920 http://www.wikidata.org 841008708 http://dbpedia.org 411914840 http://pagelinks.dbpedia.org 158878272 http://de.dbpedia.org 119876594 http://ext.dbpedia.org 99042212 http://dblp.l3s.de 81987210 http://pagelinks.de.dbpedia.org 59622795 http://yago-knowledge.org/resource 44963422 http://umbel.org/umbel/rc 480616 http://www.openlinksw.com/schemas/RDF_Mapper_Ontology/1.0/ 256065 http://topicalconcepts.dbpedia.org 157560 http://dbpedia.org/resource/classes# 28880 http://schema.org 8727 http://localhost:8890/DAV/ 4806 http://www.openlinksw.com/schemas/virtrdf# 2472 http://umbel.org/umbel 1584 http://open.vocab.org/terms 1480 http://purl.org/ontology/bibo/ 1226 http://purl.org/goodrelations/v1 937 http://purl.org/dc/terms/ 857 http://www.openlinksw.com/schemas/opengraph 804 http://www.openlinksw.com/schemas/linkedin 741 http://www.openlinksw.com/schemas/googleplus 696 http://www.openlinksw.com/schemas/google-base 691 http://www.openlinksw.com/schemas/cv 661 virtrdf-label 638 http://xmlns.com/foaf/0.1/ 557 http://rdfs.org/sioc/ns# 553 http://www.openlinksw.com/schemas/evri 482 http://www.openlinksw.com/schemas/crunchbase 444 http://bblfish.net/work/atom-owl/2006-06-06/ 386 http://scot-project.org/scot/ns# 332 http://www.openlinksw.com/schemas/zillow 311 http://www.w3.org/2004/02/skos/core 252 http://www.openlinksw.com/schemas/cnet 225 http://www.openlinksw.com/schemas/tesco 183 http://www.openlinksw.com/schemas/bestbuy 172 http://www.w3.org/2002/07/owl# 160 http://www.w3.org/2002/07/owl 160 http://www.openlinksw.com/schemas/angel# 144 http://www.openlinksw.com/schemas/amazon 143 http://purl.org/dc/elements/1.1/ 139 http://www.w3.org/2007/05/powder-s# 117 http://www.openlinksw.com/schemas/twitter 103 http://www.openlinksw.com/schemas/stackoverflow# 102 http://www.openlinksw.com/schemas/klout 90 http://www.w3.org/2000/01/rdf-schema# 87 http://www.w3.org/1999/02/22-rdf-syntax-ns# 85 http://www.openlinksw.com/schemas/ebay 79 http://www.openlinksw.com/schema/attribution# 68 http://www.openlinksw.com/schemas/nyt 41 http://www.openlinksw.com/schemas/wolframalpha# 32 http://www.openlinksw.com/schemas/oplbase 26 http://www.openlinksw.com/schemas/cert# 23 http://www.openlinksw.com/schemas/dbpedia-spotlight# 21 http://www.openlinksw.com/schemas/money 21 http://localhost:8890/sparql 14 http://dbpedia.org/schema/property_rules# 12 dbprdf-label 6 http://www.w3.org/ns/ldp# 3 62 Rows. -- 58092 msec.  Congratulations, you just imported nearly 850 million triples (or nearly 6 G triples for all datasets). ### Backing up this initial state Now is a good moment to backup the whole db (takes about half an hour): sudo -i cd / /etc/init.d/virtuoso-opensource stop && tar -cvf - /var/lib/virtuoso | lzop > virtuoso-7.1.0-DBDUMP-$(date '+%F')-dbpedia-2015-04-en_de.tar.lzop &&
/etc/init.d/virtuoso-opensource start


Afterwards you might want to repack this with xz (lzma) like this:

# apt-get install xz pxz
for f in virtuoso-7.1.0-DBDUMP-*.tar.lzop ; do lzop -d -c "$f" | pxz > "${f%lzop}.xz" ; done


Yay, done 😉
As always, feel free to leave comments if i made a mistake or to tell us about your problems or how happy you are :D.

### Thanks

Many thanks to the DBpedia team for their endless efforts of providing us all with a great dataset. Also many thanks to the Virtuoso crew for releasing an OpenSource version of their DB.

• 2015-12-07: added a check for older installed versions.

# SciPy Hierarchical Clustering and Dendrogram Tutorial

This is a tutorial on how to use scipy's hierarchical clustering.

One of the benefits of hierarchical clustering is that you don't need to already know the number of clusters k in your data in advance. Sadly, there doesn't seem to be much documentation on how to actually use scipy's hierarchical clustering to make an informed decision and then retrieve the clusters.

In the following I'll explain:

## Other works:¶

• I teach machines to associate like humans. In that project I used hierarchical clustering to group similar learned graph patterns together.
• I'm always searching for good students, contact me.

## Naming conventions:¶

Before we start, as i know that it's easy to get lost, some naming conventions:

• X samples (n x m array), aka data points or "singleton clusters"
• n number of samples
• m number of features
• Z cluster linkage array (contains the hierarchical clustering information)
• k number of clusters

So, let's go.

## Imports and Setup¶

In [1]:
# needed imports
from matplotlib import pyplot as plt
import numpy as np

In [2]:
# some setting for this notebook to actually show the graphs inline
# you probably won't need this
%matplotlib inline
np.set_printoptions(precision=5, suppress=True)  # suppress scientific float notation


## Generating Sample Data¶

You'll obviously not need this step to run the clustering if you have own data.

The only thing you need to make sure is that you convert your data into a matrix X with n samples and m features, so that X.shape == (n, m).

In [3]:
# generate two clusters: a with 100 points, b with 50:
np.random.seed(4711)  # for repeatability of this tutorial
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
X = np.concatenate((a, b),)
print X.shape  # 150 samples with 2 dimensions
plt.scatter(X[:,0], X[:,1])
plt.show()

(150, 2)


## Perform the Hierarchical Clustering¶

Now that we have some very simple sample data, let's do the actual clustering on it:

In [4]:
# generate the linkage matrix


Done. That was pretty simple, wasn't it?

Well, sure it was, this is python ;), but what does the weird 'ward' mean there and how does this actually work?

As the scipy linkage docs tell us, 'ward' is one of the methods that can be used to calculate the distance between newly formed clusters. 'ward' causes linkage() to use the Ward variance minimization algorithm.

I think it's a good default choice, but it never hurts to play around with some other common linkage methods like 'single', 'complete', 'average', ... and the different distance metrics like 'euclidean' (default), 'cityblock' aka Manhattan, 'hamming', 'cosine'... if you have the feeling that your data should not just be clustered to minimize the overall intra cluster variance in euclidean space. For example, you should have such a weird feeling with long (binary) feature vectors (e.g., word-vectors in text clustering).

As you can see there's a lot of choice here and while python and scipy make it very easy to do the clustering, it's you who has to understand and make these choices. If i find the time, i might give some more practical advice about this, but for now i'd urge you to at least read up on the mentioned linked methods and metrics to make a somewhat informed choice. Another thing you can and should definitely do is check the Cophenetic Correlation Coefficient of your clustering with help of the cophenet() function. This (very very briefly) compares (correlates) the actual pairwise distances of all your samples to those implied by the hierarchical clustering. The closer the value is to 1, the better the clustering preserves the original distances, which in our case is pretty close:

In [5]:
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

c, coph_dists = cophenet(Z, pdist(X))
c

Out[5]:
0.98001483875742679

No matter what method and metric you pick, the linkage() function will use that method and metric to calculate the distances of the clusters (starting with your n individual samples (aka data points) as singleton clusters)) and in each iteration will merge the two clusters which have the smallest distance according the selected method and metric. It will return an array of length n - 1 giving you information about the n - 1 cluster merges which it needs to pairwise merge n clusters. Z[i] will tell us which clusters were merged in the i-th iteration, let's take a look at the first two points that were merged:

In [6]:
Z[0]

Out[6]:
array([ 52.     ,  53.     ,   0.04151,   2.     ])

We can see that each row of the resulting array has the format [idx1, idx2, dist, sample_count].

In its first iteration the linkage algorithm decided to merge the two clusters (original samples here) with indices 52 and 53, as they only had a distance of 0.04151. This created a cluster with a total of 2 samples.

In [7]:
Z[1]

Out[7]:
array([ 14.     ,  79.     ,   0.05914,   2.     ])

In the second iteration the algorithm decided to merge the clusters (original samples here as well) with indices 14 and 79, which had a distance of 0.04914. This again formed another cluster with a total of 2 samples.

The indices of the clusters until now correspond to our samples. Remember that we had a total of 150 samples, so indices 0 to 149. Let's have a look at the first 20 iterations:

In [8]:
Z[:20]

Out[8]:
array([[  52.     ,   53.     ,    0.04151,    2.     ],
[  14.     ,   79.     ,    0.05914,    2.     ],
[  33.     ,   68.     ,    0.07107,    2.     ],
[  17.     ,   73.     ,    0.07137,    2.     ],
[   1.     ,    8.     ,    0.07543,    2.     ],
[  85.     ,   95.     ,    0.10928,    2.     ],
[ 108.     ,  131.     ,    0.11007,    2.     ],
[   9.     ,   66.     ,    0.11302,    2.     ],
[  15.     ,   69.     ,    0.11429,    2.     ],
[  63.     ,   98.     ,    0.1212 ,    2.     ],
[ 107.     ,  115.     ,    0.12167,    2.     ],
[  65.     ,   74.     ,    0.1249 ,    2.     ],
[  58.     ,   61.     ,    0.14028,    2.     ],
[  62.     ,  152.     ,    0.1726 ,    3.     ],
[  41.     ,  158.     ,    0.1779 ,    3.     ],
[  10.     ,   83.     ,    0.18635,    2.     ],
[ 114.     ,  139.     ,    0.20419,    2.     ],
[  39.     ,   88.     ,    0.20628,    2.     ],
[  70.     ,   96.     ,    0.21931,    2.     ],
[  46.     ,   50.     ,    0.22049,    2.     ]])

We can observe that until iteration 13 the algorithm only directly merged original samples. We can also observe the monotonic increase of the distance.

In iteration 14 the algorithm decided to merge cluster indices 62 with 152. If you paid attention the 152 should astonish you as we only have original sample indices 0 to 149 for our 150 samples. All indices idx >= len(X) actually refer to the cluster formed in Z[idx - len(X)].

This means that while idx 149 corresponds to X[149] that idx 150 corresponds to the cluster formed in Z[0], idx 151 to Z[1], 152 to Z[2], ...

Hence, the merge iteration 14 merged sample 62 to our samples 33 and 68 that were previously merged in iteration 3 corresponding to Z[2] (152 - 150).

Let's check out the points coordinates to see if this makes sense:

In [9]:
X[[33, 68, 62]]

Out[9]:
array([[ 9.83913, -0.4873 ],
[ 9.89349, -0.44152],
[ 9.97793, -0.56383]])

Seems pretty close, but let's plot the points again and highlight them:

In [10]:
idxs = [33, 68, 62]
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1])  # plot all points
plt.scatter(X[idxs,0], X[idxs,1], c='r')  # plot interesting points in red again
plt.show()


We can see that the 3 red dots are pretty close to each other, which is a good thing.

The same happened in iteration 15 where the alrogithm merged indices 41 to 15 and 69:

In [11]:
idxs = [33, 68, 62]
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1])
plt.scatter(X[idxs,0], X[idxs,1], c='r')
idxs = [15, 69, 41]
plt.scatter(X[idxs,0], X[idxs,1], c='y')
plt.show()


Showing that the 3 yellow dots are also quite close.

And so on...

We'll later come back to visualizing this, but now let's have a look at what's called a dendrogram of this hierarchical clustering first:

## Plotting a Dendrogram¶

A dendrogram is a visualization in form of a tree showing the order and distances of merges during the hierarchical clustering.

In [12]:
# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
Z,
leaf_rotation=90.,  # rotates the x axis labels
leaf_font_size=8.,  # font size for the x axis labels
)
plt.show()


(right click and "View Image" to see full resolution)

If this is the first time you see a dendrogram, it's probably quite confusing, so let's take this apart...

• On the x axis you see labels. If you don't specify anything else they are the indices of your samples in X.
• On the y axis you see the distances (of the 'ward' method in our case).

Starting from each label at the bottom, you can see a vertical line up to a horizontal line. The height of that horizontal line tells you about the distance at which this label was merged into another label or cluster. You can find that other cluster by following the other vertical line down again. If you don't encounter another horizontal line, it was just merged with the other label you reach, otherwise it was merged into another cluster that was formed earlier.

Summarizing:

• horizontal lines are cluster merges
• vertical lines tell you which clusters/labels were part of merge forming that new cluster
• heights of the horizontal lines tell you about the distance that needed to be "bridged" to form the new cluster

You can also see that from distances > 25 up there's a huge jump of the distance to the final merge at a distance of approx. 180. Let's have a look at the distances of the last 4 merges:

In [13]:
Z[-4:,2]

Out[13]:
array([  15.11533,   17.11527,   23.12199,  180.27043])

Such distance jumps / gaps in the dendrogram are pretty interesting for us. They indicate that something is merged here, that maybe just shouldn't be merged. In other words: maybe the things that were merged here really don't belong to the same cluster, telling us that maybe there's just 2 clusters here.

Looking at indices in the above dendrogram also shows us that the green cluster only has indices >= 100, while the red one only has such < 100. This is a good thing as it shows that the algorithm re-discovered the two classes in our toy example.

In case you're wondering about where the colors come from, you might want to have a look at the color_threshold argument of dendrogram(), which as not specified automagically picked a distance cut-off value of 70 % of the final merge and then colored the first clusters below that in individual colors.

### Dendrogram Truncation¶

As you might have noticed, the above is pretty big for 150 samples already and you probably have way more in real scenarios, so let me spend a few seconds on highlighting some other features of the dendrogram() function:

In [14]:
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
Z,
truncate_mode='lastp',  # show only the last p merged clusters
p=12,  # show only the last p merged clusters
show_leaf_counts=False,  # otherwise numbers in brackets are counts
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,  # to get a distribution impression in truncated branches
)
plt.show()


The above shows a truncated dendrogram, which only shows the last p=12 out of our 149 merges.

First thing you should notice are that most labels are missing. This is because except for X[40] all other samples were already merged into clusters before the last 12 merges.

The parameter show_contracted allows us to draw black dots at the heights of those previous cluster merges, so we can still spot gaps even if we don't want to clutter the whole visualization. In our example we can see that the dots are all at pretty small distances when compared to the huge last merge at a distance of 180, telling us that we probably didn't miss much there.

As it's kind of hard to keep track of the cluster sizes just by the dots, dendrogram() will by default also print the cluster sizes in brackets () if a cluster was truncated:

In [15]:
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
dendrogram(
Z,
truncate_mode='lastp',  # show only the last p merged clusters
p=12,  # show only the last p merged clusters
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,  # to get a distribution impression in truncated branches
)
plt.show()


We can now see that the right most cluster already consisted of 33 samples before the last 12 merges.

### Eye Candy¶

Even though this already makes for quite a nice visualization, we can pimp it even more by also annotating the distances inside the dendrogram by using some of the useful return values dendrogram():

In [16]:
def fancy_dendrogram(*args, **kwargs):
max_d = kwargs.pop('max_d', None)
if max_d and 'color_threshold' not in kwargs:
kwargs['color_threshold'] = max_d
annotate_above = kwargs.pop('annotate_above', 0)

ddata = dendrogram(*args, **kwargs)

if not kwargs.get('no_plot', False):
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata['color_list']):
x = 0.5 * sum(i[1:3])
y = d[1]
if y > annotate_above:
plt.plot(x, y, 'o', c=c)
plt.annotate("%.3g" % y, (x, y), xytext=(0, -5),
textcoords='offset points',
va='top', ha='center')
if max_d:
plt.axhline(y=max_d, c='k')
return ddata

In [17]:
fancy_dendrogram(
Z,
truncate_mode='lastp',
p=12,
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,
annotate_above=10,  # useful in small plots so annotations don't overlap
)
plt.show()


## Selecting a Distance Cut-Off aka Determining the Number of Clusters¶

As explained above already, a huge jump in distance is typically what we're interested in if we want to argue for a certain number of clusters. If you have the chance to do this manually, i'd always opt for that, as it allows you to gain some insights into your data and to perform some sanity checks on the edge cases. In our case i'd probably just say that our cut-off is 50, as the jump is pretty obvious:

In [18]:
# set cut-off to 50
max_d = 50  # max_d as in max_distance


Let's visualize this in the dendrogram as a cut-off line:

In [19]:
fancy_dendrogram(
Z,
truncate_mode='lastp',
p=12,
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,
annotate_above=10,
max_d=max_d,  # plot a horizontal cut-off line
)
plt.show()


As we can see, we ("surprisingly") have two clusters at this cut-off.

In general for a chosen cut-off value max_d you can always simply count the number of intersections with vertical lines of the dendrogram to get the number of formed clusters. Say we choose a cut-off of max_d = 16, we'd get 4 final clusters:

In [20]:
fancy_dendrogram(
Z,
truncate_mode='lastp',
p=12,
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,
annotate_above=10,
max_d=16,
)
plt.show()


### Automated Cut-Off Selection (or why you shouldn't rely on this)¶

Now while this manual selection of a cut-off value offers a lot of benefits when it comes to checking for a meaningful clustering and cut-off, there are cases in which you want to automate this.

The problem again is that there is no golden method to pick the number of clusters for all cases (which is why i think the investigative & backtesting manual method is preferable). Wikipedia lists a couple of common methods. Reading this, you should realize how different the approaches and how vague their descriptions are.

I honestly think it's a really bad idea to just use any of those methods, unless you know the data you're working on really really well.

#### Inconsistency Method¶

For example, let's have a look at the "inconsistency" method, which seems to be one of the defaults for the fcluster() function in scipy.

The question driving the inconsistency method is "what makes a distance jump a jump?". It answers this by comparing each cluster merge's height h to the average avg and normalizing it by the standard deviation std formed over the depth previous levels:

$$inconsistency = \frac{h - avg}{std}$$

The following shows a matrix of the avg, std, count, inconsistency for each of the last 10 merges of our hierarchical clustering with depth = 5

In [21]:
from scipy.cluster.hierarchy import inconsistent

depth = 5
incons = inconsistent(Z, depth)
incons[-10:]

Out[21]:
array([[  1.80875,   2.17062,  10.     ,   2.44277],
[  2.31732,   2.19649,  16.     ,   2.52742],
[  2.24512,   2.44225,   9.     ,   2.37659],
[  2.30462,   2.44191,  21.     ,   2.63875],
[  2.20673,   2.68378,  17.     ,   2.84582],
[  1.95309,   2.581  ,  29.     ,   4.05821],
[  3.46173,   3.53736,  28.     ,   3.29444],
[  3.15857,   3.54836,  28.     ,   3.93328],
[  4.9021 ,   5.10302,  28.     ,   3.57042],
[ 12.122  ,  32.15468,  30.     ,   5.22936]])

Now you might be tempted to say "yay, let's just pick 5" as a limit in the inconsistencies, but look at what happens if we set depth to 3 instead:

In [22]:
depth = 3
incons = inconsistent(Z, depth)
incons[-10:]

Out[22]:
array([[  3.63778,   2.55561,   4.     ,   1.35908],
[  3.89767,   2.57216,   7.     ,   1.54388],
[  3.05886,   2.66707,   6.     ,   1.87115],
[  4.92746,   2.7326 ,   7.     ,   1.39822],
[  4.76943,   3.16277,   6.     ,   1.60456],
[  5.27288,   3.56605,   7.     ,   2.00627],
[  8.22057,   4.07583,   7.     ,   1.69162],
[  7.83287,   4.46681,   7.     ,   2.07808],
[ 11.38091,   6.2943 ,   7.     ,   1.86535],
[ 37.25845,  63.31539,   7.     ,   2.25872]])

Oups! This should make you realize that the inconsistency values heavily depend on the depth of the tree you calculate the averages over.

Another problem in its calculation is that the previous d levels' heights aren't normally distributed, but expected to increase, so you can't really just treat the current level as an "outlier" of a normal distribution, as it's expected to be bigger.

#### Elbow Method¶

Another thing you might see out there is a variant of the "elbow method". It tries to find the clustering step where the acceleration of distance growth is the biggest (the "strongest elbow" of the blue line graph below, which is the highest value of the green graph below):

In [23]:
last = Z[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)

acceleration = np.diff(last, 2)  # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()
k = acceleration_rev.argmax() + 2  # if idx 0 is the max of this we want 2 clusters
print "clusters:", k

clusters: 2


While this works nicely in our simplistic example (the green line takes its maximum for k=2), it's pretty flawed as well.

One issue of this method has to do with the way an "elbow" is defined: you need at least a right and a left point, which implies that this method will never be able to tell you that all your data is in one single cluster only.

Another problem with this variant lies in the np.diff(Z[:, 2], 2) though. The order of the distances in Z[:, 2] isn't properly reflecting the order of merges within one branch of the tree. In other words: there is no guarantee that the distance of Z[i] is contained in the branch of Z[i+1]. By simply computing the np.diff(Z[:, 2], 2) we assume that this doesn't matter and just compare distance jumps from different branches of our merge tree.

If you still don't want to believe this, let's just construct another simplistic example but this time with very different variances in the different clusters:

In [24]:
c = np.random.multivariate_normal([40, 40], [[20, 1], [1, 30]], size=[200,])
d = np.random.multivariate_normal([80, 80], [[30, 1], [1, 30]], size=[200,])
e = np.random.multivariate_normal([0, 100], [[100, 1], [1, 100]], size=[200,])
X2 = np.concatenate((X, c, d, e),)
plt.scatter(X2[:,0], X2[:,1])
plt.show()


As you can see we have 5 clusters now, but they have increasing variances... let's have a look at the dendrogram again and how you can use it to spot the problem:

In [25]:
Z2 = linkage(X2, 'ward')
plt.figure(figsize=(10,10))
fancy_dendrogram(
Z2,
truncate_mode='lastp',
p=30,
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,
annotate_above=40,
max_d=170,
)
plt.show()


When looking at a dendrogram like this and trying to put a cut-off line somewhere, you should notice the very different distributions of merge distances below that cut-off line. Compare the distribution in the cyan cluster to the red, green or even two blue clusters that have even been truncated away. In the cyan cluster below the cut-off we don't really have any discontinuity of merge distances up to very close to the cut-off line. The two blue clusters on the other hand are each merged below a distance of 25, and have a gap of > 155 to our cut-off line.

The variant of the "elbow" method will incorrectly see the jump from 167 to 180 as minimal and tell us we have 4 clusters:

In [26]:
last = Z2[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)

acceleration = np.diff(last, 2)  # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()
k = acceleration_rev.argmax() + 2  # if idx 0 is the max of this we want 2 clusters
print "clusters:", k

clusters: 4


The same happens with the inconsistency metric:

In [27]:
print inconsistent(Z2, 5)[-10:]

[[  13.99222   15.56656   30.         3.86585]
[  16.73941   18.5639    30.         3.45983]
[  19.05945   20.53211   31.         3.49953]
[  19.25574   20.82658   29.         3.51907]
[  21.36116   26.7766    30.         4.50256]
[  36.58101   37.08602   31.         3.50761]
[  12.122     32.15468   30.         5.22936]
[  42.6137   111.38577   31.         5.13038]
[  81.75199  208.31582   31.         5.30448]
[ 147.25602  307.95701   31.         3.6215 ]]


I hope you can now understand why i'm warning against blindly using any of those methods on a dataset you know nothing about. They can give you some indication, but you should always go back in and check if the results make sense, for example with a dendrogram which is a great tool for that (especially if you have higher dimensional data that you can't simply visualize anymore).

## Retrieve the Clusters¶

Now, let's finally have a look at how to retrieve the clusters, for different ways of determining k. We can use the fcluster function.

### Knowing max_d:¶

Let's say we determined the max distance with help of a dendrogram, then we can do the following to get the cluster id for each of our samples:

In [28]:
from scipy.cluster.hierarchy import fcluster
max_d = 50
clusters = fcluster(Z, max_d, criterion='distance')
clusters

Out[28]:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

### Knowing k:¶

Another way starting from the dendrogram is to say "i can see i have k=2" clusters. You can then use:

In [29]:
k=2
fcluster(Z, k, criterion='maxclust')

Out[29]:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

### Using the Inconsistency Method (default):¶

If you're really sure you want to use the inconsistency method to determine the number of clusters in your dataset, you can use the default criterion of fcluster() and hope you picked the correct values:

In [30]:
from scipy.cluster.hierarchy import fcluster
fcluster(Z, 8, depth=10)

Out[30]:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

If you're lucky enough and your data is very low dimensional, you can actually visualize the resulting clusters very easily:

In [31]:
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1], c=clusters, cmap='prism')  # plot points with cluster dependent colors
plt.show()


I hope you enjoyed this tutorial. Feedback welcome ;)

# DBpedia 2014 Stats – Top Subjects, Predicates and Objects

Ever wondered what the top subjects / predicates / objects are in DBpedia?

I recently came across this problem while trying to draw a random sample of nodes from DBpedia which follow a given degree distribution for my PhD.

Turns out this is actually more difficult than i expected. Mostly due to the fact that quad stores don’t optimize for such queries. This means that you can’t just ask a SPARQL endpoint (not even your local one) to give you the top subjects, predicates or objects with a query like this:

select ?n count(*) as ?c
where {
?n ?p ?o.
}
order by desc(?c)
limit 10


Try yourself here if you don’t believe me… (i set it to time out after 15 seconds and it will return quite a dangerously nonsensical result if you’re not aware that you might get partial answers).

### Some Rant

So this lead me to the fascinating conclusion that our beloved RDF query language doesn’t even allow us to answer simple questions such as “which node is most often used as a subject / predicate / object?” (we’re talking with a single SPARQL endpoint here, don’t even try dragging me into an open/closed world assumption discussion, …).

So, all is great, let’s just not ask those evil questions…

… said no (computer) scientist ever.

So let’s get our hands dirty and use some unix tool magic…

## Working with Dumps in NT Format

Luckily, I already had all the dumps laid out locally as described here, and lucky again, they are in N-Triples format.

N-Triples is a line based format, which means we have exactly one triple per line. I don’t exactly know whom to thank for this, but should you ever read this (wait, why are you reading my blog?) THANK YOU. It means that neither subject nor predicate nor object can contain (unescaped) newlines. And this means that you can actually quite sanely sort and parse .nt files with standard unix tools that have been optimized by generations of smart people.

I think you see where this is going: a good old bash one-liner with grep, cut, sort and uniq, by far the fastest tools i know for the job.

## A Word about Sort Orders and Locales

Sort orders depend on your locale! This means that files sorted with a locale such as en_US.UTF-8 are not properly sorted for someone with a locale such as de_DE.UTF-8. Hence it’s wise to always run this in a shell before working with sort:

export LC_ALL=C


It resets your locale to a classic C byte-wise one, having the nice side effect that it’s faster as well.

## Deduplication

First, it turns out the DBpedia dumps actually contain quite an astonishing amount of duplicate triples. This is not a problem if loaded into a quad store as they’ll just count once, but for counting them like we will, it is a problem.

To split them apart let’s do the following: we pick up all the dump files that are loaded into our endpoint with pv a handy little tool similar to cat, but it shows a nice progress bar. Then we decompress with zcat, remove comments from the files with grep and then call sort. We actually tell sort to use a ton of RAM (32 GB), but actually not even that is enough for the > 80 GB decompressed dumps, so we need temp files. We can direct sort to put them onto an SSD instead of just in /tmp as by default, and we can also compress those temp files on the fly. lzop is a very fast compression tool and the perfect fit for this (not compressing the files with this actually degrades performance even at 300 MB/s write speeds of the SSD!). After this we use tee to multiplex our stream into two channels: one plain uniq and gzipped with pigz (like gzip but parallel, as gzipping > 80 GB becomes quite the bottleneck here otherwise) into dbpedia_uniq.nt.gz and another invocation of uniq -c -d which only counts the duplicate lines and gzips (this is ok to be single threaded, as it’s not sooo big) them into dbpedia_dups.txt.gz.

pv /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/{dbpedia.org,ext.dbpedia.org,pagelinks.dbpedia.org,topicalconcepts.dbpedia.org}/* |
zcat |  # decompress
grep -v -E '^\s*#' |  # ignore comments in the nt files
sort -S32G -T/ssd/tmp/ --compress-program=lzop |
tee \
>( uniq | pigz > dbpedia_uniq.nt.gz ) \
>( uniq -c -d | gzip > dbpedia_dups.txt.gz ) \
>/dev/null


As you can see from the first line include external, pagelinks and topicalconcepts datasets, but the process is really the same no matter what.

After ~ 10 minutes we’re left with a 6.5 GB dbpedia_uniq.nt.gz (547,084,682 unique triples) and a 238 MB dbpedia_dups.txt.gz.

## Top Duplicates

The top duplicates as acquired with

zcat dbpedia_dups.txt.gz | sort -n -r -S8G | head -n20


are (full file (238 MB)):

   4891 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_Slovenia.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Flag_of_Slovenia.svg> .
4891 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_Slovenia.svg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_Slovenia.svg?width=300> .
4891 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_Slovenia.svg> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Flag_of_Slovenia.svg> .
1520 <http://commons.wikimedia.org/wiki/Special:FilePath/Naval_Ensign_of_the_United_Kingdom.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Naval_Ensign_of_the_United_Kingdom.svg> .
1520 <http://commons.wikimedia.org/wiki/Special:FilePath/Naval_Ensign_of_the_United_Kingdom.svg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Naval_Ensign_of_the_United_Kingdom.svg?width=300> .
1520 <http://commons.wikimedia.org/wiki/Special:FilePath/Naval_Ensign_of_the_United_Kingdom.svg> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Naval_Ensign_of_the_United_Kingdom.svg> .
1195 <http://commons.wikimedia.org/wiki/Special:FilePath/Airplane_silhouette.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Airplane_silhouette.svg> .
1195 <http://commons.wikimedia.org/wiki/Special:FilePath/Airplane_silhouette.svg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Airplane_silhouette.svg?width=300> .
1195 <http://commons.wikimedia.org/wiki/Special:FilePath/Airplane_silhouette.svg> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Airplane_silhouette.svg> .
1188 <http://commons.wikimedia.org/wiki/Special:FilePath/Med_1.png?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Med_1.png> .
1188 <http://commons.wikimedia.org/wiki/Special:FilePath/Med_1.png> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Med_1.png?width=300> .
1188 <http://commons.wikimedia.org/wiki/Special:FilePath/Med_1.png> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Med_1.png> .
1159 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_the_British_Army.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Flag_of_the_British_Army.svg> .
1159 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_the_British_Army.svg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_the_British_Army.svg?width=300> .
1159 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_the_British_Army.svg> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Flag_of_the_British_Army.svg> .
914 <http://en.wikipedia.org/wiki/Special:FilePath/Cricket_no_pic.png?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Cricket_no_pic.png> .
914 <http://en.wikipedia.org/wiki/Special:FilePath/Cricket_no_pic.png> <http://xmlns.com/foaf/0.1/thumbnail> <http://en.wikipedia.org/wiki/Special:FilePath/Cricket_no_pic.png?width=300> .
914 <http://en.wikipedia.org/wiki/Special:FilePath/Cricket_no_pic.png> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Cricket_no_pic.png> .
784 <http://commons.wikimedia.org/wiki/Special:FilePath/Illinois_-_outline_map.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Illinois_-_outline_map.svg> .


## Getting S,P,O Counts

OK, now let’s count the subject, predicate and object occurrences.
Subjects, predicates and objects are delimited with a single space (” “), everything else in the line we just count as an object (so we just count the final ” .” to the object).
Similar to the above pipeline, we use tee again to multiplex the stream into three pipelines for subject, predicate and object counts.
Each of them is mostly based on cut, first to get the fields (-f1 for subject, -f2 predicate, -f3- object), then for limiting very long strings to only the first 1024 chars. While this actually introduces some false positive matches for long literals, it’s probably safe for URIs, and reduces sort times and file sizes for the object chunk a lot. If you want very accurate counts you should probably re-run without the cut -c-1024 lines.
Afterwards in each pipeline the occurrences of a node in the s,p,o positions are sorted and counted with uniq -c, then gzipped with pigz.

pv dbpedia_uniq.nt.gz |
zcat |
tee \
>( cut -f1 -d' ' |
cut -c-1024 |
sort -S16G -T/ssd/tmp/ --compress-program=lzop |
uniq -c |
pigz > dbpedia_1_subject_counts.txt.gz ) \
>( cut -f2 -d' ' |
cut -c-1024 |
sort -S16G -T/ssd/tmp/ --compress-program=lzop |
uniq -c |
pigz > dbpedia_2_predicate_counts.txt.gz ) \
>( cut -f3- -d' ' |
cut -c-1024 |
sort -S16G -T/ssd/tmp/ --compress-program=lzop |
uniq -c | pigz > dbpedia_3_object_counts.txt.gz ) \
>/dev/null


After 15 minutes we’re left with 3 files:
dbpedia_1_subject_counts.txt.gz (214M), dbpedia_2_predicate_counts.txt.gz (387K), dbpedia_3_object_counts.txt.gz (1.9G)

As expected there are only relatively few different predicates and the objects actually take up quite a lot of data.

Before getting the tops it’s quite useful to exclude subjects and objects that occur less than 10 times with awk, which greatly reduces the filesizes and subsequent sort times:

zcat dbpedia_1_subject_counts.txt.gz | awk ' $1 > 9 { print } ' | pigz > dbpedia_1_subject_counts_o9.txt.gz zcat dbpedia_3_object_counts.txt.gz | awk '$1 > 9 { print } ' | pigz > dbpedia_3_object_counts_o9.txt.gz


As we can see from the size reduction already there’s actually way more objects occurring less than 10 times than subjects.

Similar to before the tops can be acquired with something like this:

for f in dbpedia_1_subject_counts_o9.txt.gz dbpedia_2_predicate_counts.txt.gz dbpedia_3_object_counts_o9.txt.gz ; do
zcat $f | sort -n -r | pigz >${f%.txt.gz}_tops.txt.gz
done


So here they are, the …

## Top 100 Subjects:

   8118 <http://dbpedia.org/resource/Alphabetical_list_of_communes_of_Italy>
7110 <http://dbpedia.org/resource/List_of_places_in_Afghanistan>
5857 <http://dbpedia.org/resource/List_of_populated_places_in_Bosnia_and_Herzegovina>
5712 <http://dbpedia.org/resource/2013_in_film>
5550 <http://dbpedia.org/resource/List_of_municipalities_of_Brazil>
5458 <http://dbpedia.org/resource/List_of_dialling_codes_in_Germany>
5405 <http://dbpedia.org/resource/IUCN_Red_List_vulnerable_species_(Plantae)>
5392 <http://dbpedia.org/resource/List_of_CJK_Unified_Ideographs,_part_3_of_4>
5392 <http://dbpedia.org/resource/List_of_CJK_Unified_Ideographs,_part_2_of_4>
5392 <http://dbpedia.org/resource/List_of_CJK_Unified_Ideographs,_part_1_of_4>
5182 <http://dbpedia.org/resource/IUCN_Red_List_vulnerable_species_(Animalia)>
5152 <http://dbpedia.org/resource/Index_of_India-related_articles>
5090 <http://dbpedia.org/resource/List_of_law_clerks_of_the_Supreme_Court_of_the_United_States>
5068 <http://dbpedia.org/resource/List_of_Social_Democratic_Party_of_Germany_members>
4942 <http://dbpedia.org/resource/List_of_painters_in_the_Web_Gallery_of_Art>
4873 <http://dbpedia.org/resource/List_of_stage_names>
4829 <http://dbpedia.org/resource/List_of_CJK_Unified_Ideographs,_part_4_of_4>
4795 <http://dbpedia.org/resource/List_of_Harvard_University_people>
4743 <http://dbpedia.org/resource/List_of_OMIM_disorder_codes>
4726 <http://dbpedia.org/resource/List_of_populated_places_in_Serbia>
4698 <http://dbpedia.org/resource/List_of_populated_places_in_Serbia_(alphabetic)>
4690 <http://dbpedia.org/resource/Index_of_philosophy_articles_(I%E2%80%93Q)>
4603 <http://dbpedia.org/resource/List_of_molluscan_genera_represented_in_the_fossil_record>
4493 <http://dbpedia.org/resource/List_of_American_television_programs_by_date>
4457 <http://dbpedia.org/resource/List_of_biographical_films>
4443 <http://dbpedia.org/resource/List_of_brachiopod_genera>
4355 <http://dbpedia.org/resource/List_of_English_writers>
4345 <http://dbpedia.org/resource/List_of_composers_by_name>
4341 <http://dbpedia.org/resource/List_of_historical_German_and_Czech_names_for_places_in_the_Czech_Republic>
4329 <http://dbpedia.org/resource/2012_in_film>
4275 <http://dbpedia.org/resource/List_of_people_from_Illinois>
4219 <http://dbpedia.org/resource/List_of_people_from_Texas>
4218 <http://dbpedia.org/resource/List_of_village_development_committees_of_Nepal>
4194 <http://dbpedia.org/resource/List_of_postal_codes_in_Portugal>
4159 <http://dbpedia.org/resource/IUCN_Red_List_data_deficient_species_(Chordata)>
4140 <http://dbpedia.org/resource/List_of_trilobite_genera>
4137 <http://dbpedia.org/resource/List_of_aircraft_engines>
4130 <http://dbpedia.org/resource/List_of_moths_of_Taiwan>
4084 <http://dbpedia.org/resource/List_of_flora_of_the_Sonoran_Desert_Region_by_common_name>
3992 <http://dbpedia.org/resource/List_of_film_score_composers>
3984 <http://dbpedia.org/resource/List_of_marine_gastropod_genera_in_the_fossil_record>
3930 <http://dbpedia.org/resource/List_of_performances_on_Top_of_the_Pops>
3886 <http://dbpedia.org/resource/List_of_gliders>
3873 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Romania>
3839 <http://dbpedia.org/resource/List_of_20th-century_classical_composers>
3740 <http://dbpedia.org/resource/List_of_airports_by_ICAO_code:_K>
3705 <http://dbpedia.org/resource/List_of_United_States_counties_and_county_equivalents>
3659 <http://dbpedia.org/resource/List_of_Russian_people>
3646 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Germany>
3615 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Switzerland>
3597 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Slovakia>
3589 <http://dbpedia.org/resource/List_of_protected_areas_of_China>
3541 <http://dbpedia.org/resource/Index_of_U.S._counties>
3502 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Hungary>
3499 <http://dbpedia.org/resource/The_opera_corpus>
3466 <http://dbpedia.org/resource/List_of_German_Christian_Democratic_Union_politicians>
3466 <http://dbpedia.org/resource/2012%E2%80%9313_UEFA_Europa_League_qualifying_phase_and_play-off_round>
3439 <http://dbpedia.org/resource/List_of_viruses>
3432 <http://dbpedia.org/resource/2013%E2%80%9314_UEFA_Europa_League_qualifying_phase_and_play-off_round>
3430 <http://dbpedia.org/resource/List_of_Lepidoptera_of_the_Czech_Republic>
3392 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Greece>
3378 <http://dbpedia.org/resource/List_of_surnames_in_Russia>
3378 <http://dbpedia.org/resource/List_of_film_director_and_actor_collaborations>
3377 <http://dbpedia.org/resource/2010%E2%80%9311_UEFA_Europa_League_qualifying_phase_and_play-off_round>
3342 <http://dbpedia.org/resource/2009%E2%80%9310_UEFA_Europa_League_qualifying_phase_and_play-off_round>
3327 <http://dbpedia.org/resource/Index_of_World_War_II_articles_(U)>
3295 <http://dbpedia.org/resource/List_of_country_houses_in_the_United_Kingdom>
3277 <http://dbpedia.org/resource/List_of_counties_by_U.S._state>
3255 <http://dbpedia.org/resource/List_of_moths_of_North_America_(MONA_8322-11233)>
3236 <http://dbpedia.org/resource/Catalog_of_paintings_in_the_National_Gallery,_London>
3233 <http://dbpedia.org/resource/IUCN_Red_List_endangered_species_(Animalia)>
3232 <http://dbpedia.org/resource/IUCN_Red_List_near_threatened_species_(Animalia)>
3209 <http://dbpedia.org/resource/List_of_Chopped_episodes>
3201 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Poland>
3200 <http://dbpedia.org/resource/List_of_directorial_debuts>
3192 <http://dbpedia.org/resource/List_of_postal_codes_in_Germany>
3175 <http://dbpedia.org/resource/2010_in_film>
3163 <http://dbpedia.org/resource/Index_of_philosophy_articles_(R%E2%80%93Z)>
3156 <http://dbpedia.org/resource/List_of_bannered_U.S._Routes>
3135 <http://dbpedia.org/resource/Index_of_Byzantine_Empire-related_articles>
3129 <http://dbpedia.org/resource/Index_of_Singapore-related_articles>
3114 <http://dbpedia.org/resource/List_of_postal_codes_of_Switzerland>
3107 <http://dbpedia.org/resource/2009_in_film>
3088 <http://dbpedia.org/resource/List_of_University_of_Pennsylvania_people>
3071 <http://dbpedia.org/resource/List_of_children's_television_series_by_country>
3065 <http://dbpedia.org/resource/List_of_populated_places_in_the_Netherlands>
3044 <http://dbpedia.org/resource/List_of_ZX_Spectrum_games>
3039 <http://dbpedia.org/resource/October_2011_in_sports>
3022 <http://dbpedia.org/resource/List_of_flora_of_Ohio>
3019 <http://dbpedia.org/resource/List_of_PlayStation_2_games>
3016 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Bulgaria>
3005 <http://dbpedia.org/resource/List_of_voice_actors>


### Observations:

The top subjects are clearly dominated by list-like resources. Very big “normal” articles such as those of countries like dbpedia:United_States (1375 occurrences as subject) or dbpedia:Germany (1331 occurrences as subject) can only be found below ranks of 1518 or 1673. Scrolling through the top subject counts it seems that the amount of “List” vs. non-“List” resources slowly seems to equalize around 1000 occurrences (rank 3800+), but even for subjects that “only” occur ~500 times (rank 21000+) there seem to be ~1/4 “Lists” still.

## Top 100 Predicates:

149707899 <http://dbpedia.org/ontology/wikiPageWikiLink>
86391520 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
33958849 <http://www.w3.org/2002/07/owl#sameAs>
18731754 <http://purl.org/dc/terms/subject>
13926391 <http://www.w3.org/2000/01/rdf-schema#label>
13494896 <http://dbpedia.org/ontology/wikiPageRevisionID>
13494875 <http://www.w3.org/ns/prov#wasDerivedFrom>
13494819 <http://dbpedia.org/ontology/wikiPageID>
10948106 <http://dbpedia.org/ontology/wikiPageOutDegree>
10948106 <http://dbpedia.org/ontology/wikiPageLength>
10948086 <http://xmlns.com/foaf/0.1/primaryTopic>
10948086 <http://xmlns.com/foaf/0.1/isPrimaryTopicOf>
10948086 <http://purl.org/dc/elements/1.1/language>
6473988 <http://dbpedia.org/ontology/wikiPageRedirects>
5926272 <http://dbpedia.org/ontology/abstract>
5925778 <http://www.w3.org/2000/01/rdf-schema#comment>
4267352 <http://xmlns.com/foaf/0.1/name>
4041585 <http://dbpedia.org/property/hasPhotoCollection>
3781737 <http://dbpedia.org/property/name>
2342002 <http://purl.org/dc/elements/1.1/rights>
2084717 <http://purl.org/dc/elements/1.1/description>
1514496 <http://dbpedia.org/ontology/team>
1374565 <http://xmlns.com/foaf/0.1/depiction>
1374185 <http://dbpedia.org/ontology/thumbnail>
1363398 <http://dbpedia.org/ontology/wikiPageDisambiguates>
1289141 <http://dbpedia.org/property/title>
1231780 <http://dbpedia.org/property/subdivisionType>
1171004 <http://xmlns.com/foaf/0.1/thumbnail>
1122598 <http://www.w3.org/2004/02/skos/core#prefLabel>
1080114 <http://xmlns.com/foaf/0.1/givenName>
1052578 <http://dbpedia.org/property/shortDescription>
1052115 <http://xmlns.com/foaf/0.1/surname>
1005079 <http://dbpedia.org/ontology/birthPlace>
995639 <http://dbpedia.org/ontology/birthDate>
983813 <http://dbpedia.org/property/subdivisionName>
973597 <http://dbpedia.org/ontology/birthYear>
968085 <http://dbpedia.org/property/dateOfBirth>
907869 <http://www.w3.org/2003/01/geo/wgs84_pos#lat>
906919 <http://www.w3.org/2003/01/geo/wgs84_pos#long>
861765 <http://dbpedia.org/property/goals>
846283 <http://dbpedia.org/property/placeOfBirth>
846182 <http://dbpedia.org/ontology/isPartOf>
838381 <http://dbpedia.org/property/birthPlace>
826348 <http://dbpedia.org/property/years>
656559 <http://dbpedia.org/property/length>
653929 <http://dbpedia.org/property/date>
649375 <http://xmlns.com/foaf/0.1/homepage>
643162 <http://dbpedia.org/ontology/careerStation>
641528 <http://dbpedia.org/ontology/years>
574296 <http://dbpedia.org/property/birthDate>
556627 <http://dbpedia.org/property/genre>
553122 <http://dbpedia.org/ontology/country>
539366 <http://dbpedia.org/property/clubs>
529649 <http://dbpedia.org/property/location>
525787 <http://dbpedia.org/property/rd1Team>
512507 <http://dbpedia.org/ontology/numberOfGoals>
501875 <http://dbpedia.org/ontology/genre>
492028 <http://dbpedia.org/ontology/numberOfMatches>
453911 <http://dbpedia.org/ontology/deathDate>
449759 <http://dbpedia.org/ontology/deathYear>
448696 <http://dbpedia.org/property/dateOfDeath>
448362 <http://www.w3.org/2002/07/owl#equivalentClass>
446799 <http://dbpedia.org/property/caption>
446238 <http://www.w3.org/2000/01/rdf-schema#subClassOf>
437797 <http://dbpedia.org/property/wordnet_type>
435648 <http://dbpedia.org/property/type>
431262 <http://dbpedia.org/property/caps>
418391 <http://dbpedia.org/ontology/utcOffset>
362378 <http://dbpedia.org/property/percentage>
362327 <http://dbpedia.org/ontology/type>
355814 <http://dbpedia.org/property/country>
346584 <http://dbpedia.org/property/candidate>
340788 <http://dbpedia.org/property/starring>
338307 <http://dbpedia.org/ontology/location>
327879 <http://dbpedia.org/ontology/family>
326730 <http://dbpedia.org/property/longew>
326699 <http://dbpedia.org/property/latns>
315041 <http://dbpedia.org/property/writer>
314566 <http://dbpedia.org/ontology/starring>
312958 <http://dbpedia.org/property/label>
310020 <http://dbpedia.org/property/rd2Team>
306816 <http://dbpedia.org/property/settlementType>
306271 <http://dbpedia.org/property/longd>
306246 <http://dbpedia.org/property/latd>
306195 <http://dbpedia.org/ontology/populationTotal>
293237 <http://dbpedia.org/property/team>
283282 <http://dbpedia.org/property/producer>
279814 <http://dbpedia.org/ontology/occupation>
278108 <http://dbpedia.org/ontology/order>
277006 <http://dbpedia.org/property/episodenumber>
275430 <http://dbpedia.org/property/longm>
275416 <http://dbpedia.org/property/latm>
272562 <http://dbpedia.org/ontology/deathPlace>
267256 <http://dbpedia.org/ontology/class>
265454 <http://dbpedia.org/property/timezone>
264081 <http://dbpedia.org/ontology/viafId>


### Observations:

The predicates are clearly dominated by dbpedia-owl:wikiPageWikiLink and rdf:type relations.
What’s a bit surprising for me is that dcterms:subject occurs less often than rdf:type, but my guess is that it’s probably due to YAGO and also hierarchy materialization (Athlete is also a Person). There’s a slight mismatch between dbpedia-owl:wikiPageRevisionID and prov:wasDerivedFrom. There are more dbpedia-ontology:abstracts than rdfs:comments and more geo:lats than geo:longs.

## Top 100 Objects:

10948086 <http://xmlns.com/foaf/0.1/Document> .
10948086 "en"^^<http://www.w3.org/2001/XMLSchema#string> .
6239553 "1"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
2250659 <http://dbpedia.org/class/yago/PhysicalEntity100001930> .
2169386 <http://dbpedia.org/class/yago/Object100002684> .
2155200 <http://www.w3.org/2002/07/owl#Thing> .
1974654 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#Agent> .
1974654 <http://dbpedia.org/ontology/Agent> .
1816213 <http://dbpedia.org/class/yago/YagoLegalActorGeo> .
1650316 <http://xmlns.com/foaf/0.1/Person> .
1649647 <http://wikidata.dbpedia.org/resource/Q5> .
1649647 <http://wikidata.dbpedia.org/resource/Q215627> .
1649647 <http://schema.org/Person> .
1649646 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#NaturalPerson> .
1649646 <http://dbpedia.org/ontology/Person> .
1621660 <http://dbpedia.org/class/yago/Whole100003553> .
1318799 <http://dbpedia.org/resource/Category:Living_people> .
1290718 <http://dbpedia.org/class/yago/YagoLegalActor> .
1257968 <http://dbpedia.org/class/yago/YagoPermanentlyLocatedEntity> .
1192248 <http://www.w3.org/2004/02/skos/core#Concept> .
1090313 <http://dbpedia.org/class/yago/LivingThing100004258> .
1090140 <http://dbpedia.org/class/yago/Organism100004475> .
1046726 <http://dbpedia.org/class/yago/Person100007846> .
1020287 <http://dbpedia.org/class/yago/CausalAgent100007347> .
868376 <http://dbpedia.org/resource/United_States> .
816854 <http://www.ontologydesignpatterns.org/ont/d0.owl#Location> .
816837 <http://schema.org/Place> .
816837 <http://dbpedia.org/ontology/Wikidata:Q532> .
816837 <http://dbpedia.org/ontology/Place> .
814269 <http://dbpedia.org/class/yago/YagoGeoEntity> .
726965 <http://dbpedia.org/class/yago/Abstraction100002137> .
658562 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#Situation> .
643162 <http://dbpedia.org/ontology/CareerStation> .
561841 <http://dbpedia.org/class/yago/LivingPeople> .
547827 <http://dbpedia.org/ontology/PopulatedPlace> .
547037 "0"^^<http://www.w3.org/2001/XMLSchema#integer> .
539993 "1"^^<http://www.w3.org/2001/XMLSchema#integer> .
531929 <http://www.opengis.net/gml/_Feature> .
528794 <http://dbpedia.org/class/yago/Artifact100021939> .
526256 <http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing> .
524742 <http://dbpedia.org/class/yago/Location100027167> .
505425 <http://dbpedia.org/class/yago/Region108630985> .
476724 "2"^^<http://www.w3.org/2001/XMLSchema#integer> .
469006 <http://dbpedia.org/ontology/Settlement> .
438713 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#InformationEntity> .
425044 <http://schema.org/CreativeWork> .
425044 <http://dbpedia.org/ontology/Work> .
419234 <http://dbpedia.org/resource/List_of_sovereign_states> .
401287 "N"@en .
400317 <http://dbpedia.org/resource/Animal> .
377252 "3"^^<http://www.w3.org/2001/XMLSchema#integer> .
358891 <http://dbpedia.org/class/yago/GeographicalArea108574314> .
350209 <http://dbpedia.org/class/yago/District108552138> .
347718 <http://dbpedia.org/class/yago/Group100031264> .
336091 <http://dbpedia.org/ontology/Athlete> .
335320 "4"^^<http://www.w3.org/2001/XMLSchema#integer> .
313062 <http://dbpedia.org/class/yago/SocialGroup107950920> .
302658 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#SocialPerson> .
302658 <http://schema.org/Organization> .
302658 <http://dbpedia.org/ontology/Organisation> .
292395 "28"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
288074 "29"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
287957 "27"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
286256 "5"^^<http://www.w3.org/2001/XMLSchema#integer> .
283702 <http://dbpedia.org/class/yago/Organization108008335> .
279633 "yes"@en .
279134 <http://dbpedia.org/ontology/SportsTeamMember> .
279134 <http://dbpedia.org/ontology/OrganisationMember> .
277439 "30"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
277025 "26"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
276000 "6"^^<http://www.w3.org/2001/XMLSchema#integer> .
264578 "31"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
263773 <http://dbpedia.org/class/yago/Contestant109613191> .
261435 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#Organism> .
261435 <http://dbpedia.org/ontology/Species> .
260007 "E"@en .
256474 <http://dbpedia.org/ontology/Eukaryote> .
255993 "25"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
252172 <http://dbpedia.org/resource/England> .
249675 "32"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
240706 <http://dbpedia.org/resource/Iran_Standard_Time> .
234043 "24"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
231236 "33"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
228539 "7"^^<http://www.w3.org/2001/XMLSchema#integer> .
221419 "0".
219180 <http://dbpedia.org/class/yago/Player110439851> .
218573 "23"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
218297 <http://dbpedia.org/class/yago/PsychologicalFeature100023100> .
217297 "34"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
215320 <http://dbpedia.org/class/yago/Athlete109820263> .
214359 "18"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
208654 <http://dbpedia.org/class/yago/Tract108673395> .
208383 <http://dbpedia.org/resource/Arthropod> .
206076 "22"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
205888 "8"^^<http://www.w3.org/2001/XMLSchema#integer> .
204693 <http://dbpedia.org/resource/Lepidoptera> .
204220 "21"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
203472 <http://dbpedia.org/class/yago/Instrumentality103575240> .
202270 "35"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .


### Observations:

The object counts are dominated on top with an order of magnitude difference by foaf:Document and “en”. The non-negative “1” follows an order of magnitude ahead of the normal “0” and “1” 😉 In between a lot of very useful types follow, and we can see that we have a lot of information about physical things, people, concepts and places. It’s also nice to see http://wikidata.dbpedia.org/resource/Q5 right under foaf:Person, even though the URI doesn’t resolve anymore(?) 🙁

The first “real” “A-Box” resource is dbpedia:United_States, followed by dbpedia:Animal, dbpedia:England, dbpedia:Iran_Standard_Time, dbpedia:Arthropod, dbpedia:Lepidoptera, dbpedia:Canada, dbpedia:Insect, dbpedia:France, dbpedia:United_Kingdom, dbpedia:India, dbpedia:Germany. In general it seems as if apart from ontology types many instances of types country, biological genus and city occur very often as objects.
The top literals seem to be numbers, especially years and single letters.

## Conclusion

We’ve seen that it’s sadly not possible to get basic top-degree-counts for big datasets via SPARQL, as the endpoints don’t seem to be optimized for these kind of queries. I hope this changes in the future as it’s quite useful to know degree distributions for all kinds of queries. Especially in the machine learning sector it seems quite essential to know if you’re dealing with a “normal” node or one of the exceptional top nodes that is several orders of magnitude bigger than the rest.

Hope you enjoyed. Feedback welcome, as always.

Thanks for all the feedback i got on this post. There are somewhat similar works, that you might be interested in:

# Setting up a local DBpedia 2014 mirror with Virtuoso 7.1.0

So you’re the guy who is allowed to setup a local DBpedia mirror or more generally a local Linked Data mirror for your work group? OK, today is your lucky day and you’re in the right place. I hope you’ll be able to benefit from my many hours of trials and errors. If anything goes wrong (or everything works fine), feel free to leave a comment below.

## Versions of this guide

There are three older versions of this guide:

• Oct. 2010: The first version focusing on DBpedia 3.5 – 3.6 and Virtuoso 6.1
• May 2012: A bigger update to DBpedia 3.7 (new local language versions) and Virtuoso 6.1.5+ (with a lot of updates making pre-processing of the dumps easier)
• Apr. 2014: Update to DBpedia 3.9 and Virtuoso 7

In this step by step guide I’ll tell you how to install a local Linked Data mirror of the DBpedia 2014, hosting a combination of the regular English and (exemplary) the i18n German datasets adding up to over half a billion triples. If this isn’t enough you can also follow the links to the Freebase, DBLP, Yago, Umbel and Schema.org datasets / vocabularies adding up to over 3.5 billion triples.

Let’s jump in.

## Used Versions

• DBpedia 2014
• Virtuoso OpenSource 7.1.0
• Ubuntu 14.04 LTS

## Prerequesits

A strong machine with root access and enough RAM: We used a VM with 4 Cores and 32 GBs of RAM for DBpedia only. If you intend to also load Freebase and other datasets i recommend at least 64 GBs of RAM (we actually ended up using a 16 Core, 256 GB RAM Server). For installing i recommend more than 128 GB free HD space for DBpedia alone, 256 GB if you want to load Freebase as well, especially for downloading and repacking the datasets, as well as the growing database file when importing (mine grew to 50 GBs for DBpedia and 180 GB with Freebase).

## Let’s go

Go and download virtuoso opensource: either from http://sourceforge.net/projects/virtuoso/ (make sure you get v7.1.0 as in this guide or a newer version).

Put the file in your home dir on the server, then extract it and switch to the directory:

cd ~
tar -xvzf virtuoso-7.1.0.tar.gz
cd virtuoso-opensource-7.1.0 # or newer, depending what you got


Now do the following to install the prerequisites and then build virtuoso:

sudo aptitude install libxml2-dev libssl-dev autoconf libgraphviz-dev \
libmagickcore-dev libmagickwand-dev dnsutils gawk bison flex gperf

# NOTICE: the following will _not_ install into /usr/local but into /usr
# (so might clash with packages by your distribution if you install
# "the" virtuoso package)
# You'll find the db in /var/lib/virtuoso/db !
# check output for errors and FIX THEM! (e.g., install missing packages)
export CFLAGS="-O2 -m64"

# the following will build with 5 processes in parallel
# choose something like your server's #CPUs + 1
make -j5


This will take about 5 min

sudo make install


Now change the following values in /var/lib/virtuoso/db/virtuoso.ini, the performance tuning stuff is according to http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtRDFPerformanceTuning:

# note: virtuoso ignores lines starting with whitespace and stuff after a ;
[Parameters]
# to, in our case /usr/local/data/datasets:
# IMPORTANT: for performance also do this
[Parameters]
# the following two are as suggested by comments in the original .ini
# file in order to use the RAM on your server:
NumberOfBuffers = 2720000
MaxDirtyBuffers = 2000000
# each buffer caches a 8K page of data and occupies approx. 8700 bytes of
# memory. It's suggested to set this value to 65 % of ram for a db only server
# so if you have 32 GB of ram: 32*1000^3*0.65/8700 = 2390804
# default is 2000 which will use 16 MB ram ;)
# Make sure to remove whitespace if you uncomment existing lines!
[Database]
MaxCheckpointRemap = 625000
# set this to 1/4th of NumberOfBuffers
[SPARQL]
# I like to increase the ResultSetMaxrows, MaxQueryCostEstimationTime
# and MaxQueryExecutionTime drastically as it's a local store where we
# do quite complex queries... up to you (don't do this if a lot of people
# use it).
# In any case for the importer to be more robust add the following setting
# to this section:
ShortenLongURIs = 1


The next step installs an init-script (autostart) and starts the virtuoso server. (If you’ve changed directories to edit /var/lib/virtuoso/db/virtuoso.ini, go back to the virtuoso source dir!):

sudo cp debian/init.d /etc/init.d/virtuoso-opensource &&
sudo chmod a+x /etc/init.d/virtuoso-opensource &&
sudo bash debian/virtuoso-opensource.postinst.debhelper


You should now have a running virtuoso server.

### DBpedia URIs (en) vs. DBpedia IRIs (i18n)

The DBpedia 2014 consists of several datasets: one “standard” English version and several localized versions for other languages (i18n). The standard version mints URIs by going through all English Wikipedia articles. For all of these the Wikipedia cross-language links are used to extract corresponding labels in other languages for the en URIs (e.g., de/labels_en_uris_de.nt.bz2). This is problematic as for example articles which are only in the German Wikipedia won’t be extracted. To solve this problem the i18n versions exists and create IRIs in the form of de.dbpedia.org for every article in the German Wikipedia (e.g., de/labels_de.nt.bz2).

This approach has several implications. For backwards compatibility reasons the standard DBpedia makes statements about URIs such as http://dbpedia.org/resource/Gerhard_Schr%C3%B6der while the local chapters, like the German one, make statements about IRIs such as http://de.dbpedia.org/resource/Gerhard_Schröder (note the ö). In other words and as written above: the standard DBpedia uses URIs to identify things, while the localized versions use IRIs. This also means that http://dbpedia.org/resource/Gerhard_Schröder shouldn’t work. That said, clicking the link will actually work as there is magic going on in your browser to give you what you probably meant. Using curl curl -i -L -H "Accept: application/rdf+xml" http://dbpedia.org/resource/Gerhard_Schröder or SPARQLing the endpoint will nevertheless not be so nice/sloppy and can cause quite some headache: select * where { dbpedia:Gerhard_Schröder ?p ?o. } vs. select * where { <http://dbpedia.org/resource/Gerhard_Schr%C3%B6der> ?p ?o. }. In order to mitigate this historic problem a bit DBpedia actually offers owl:sameAs links from IRIs to URIs: en/iri_same_as_uri_en which you should load, so you at least have a link to what you want if someone tries to get info about an IRI.

As the standard DBpedia provides labels, abstracts and a couple other things in several languages, there are two types of files in the localized DBpedia folders: There are triples directly associating the English URIs with for example the German labels (de/labels_en_uris_de) and there are the localized triple files which associate for example the DE IRIs with the German labels (de/labels_de).

For our group we decided that we wanted a reasonably complete mirror of the standard DBpedia (EN) (have a look at datasets loaded into the public DBpedia SPARQL Endpoint), but also the i18n versions for the German DBpedia loaded in separate graphs, as well as each of their pagelink datasets in another separate graph. For this we download the corresponding files in (NT) format as follows. If you need something different do so (and maybe report back if there were problems and how you solved them).

Another hint: Virtuoso can only import plain (uncompressed) or gzipped files, the DBpedia dumps are bzipped, so you either repack them into gzip format or extract them. On our server the importing procedure was reasonably slower from extracted files than from gzipped ones (ignoring the vast amount of wasted disk space for the extracted files). File access becomes a bottleneck if you have a couple of cores idling. This is why I decided on repacking all the files from bz2 to gz. As you can see I do the repacking per folder in parallel, if that’s not suitable for you, feel free to change it. You might also want to change this if you want to do it in parallel to downloading. The repackaging process below took about 1 hour but was worth it in the end. The more CPUs you have, the more you can parallelize this process.

# see comment above, you could also get the all_language.tar or another DBpedia version...
mkdir -p /usr/local/data/datasets/dbpedia/2014
cd /usr/local/data/datasets/dbpedia/2014

# if you want to save space do this:
for d in */ ; do for i in "${d%/}"/*.bz2 ; do bzcat "$i" | gzip > "${i%.bz2}.gz" && rm "$i" ; done & done
# else do:
#bunzip2 */*.bz2 &

# notice that the extraction (and repacking) of *.bz2 takes quite a while (about 1 hour)
# gzipped data is reasonably packed, but still very fast to access (in contrast to bz2), so maybe this is the best choice.


### Data Cleaning and The bulk loader scripts

In contrast to the previous versions of this article the virtuoso import will take care of shortening too long IRIs itself. Also it seems the bulk loader script is included in the more recent Virtuoso versions, so as a reference only: see the old version for the cleaning script and VirtBulkRDFLoaderExampleDbpedia and

### Importing DBpedia dumps into virtuoso

Now AFTER the re-/unpacking of the DBpedia dumps we will register all files in the dbpedia dir (recursively ld_dir_all) to be added to the dbpedia graph. If you use this method make sure that only files reside in the given subtree that you really want to import.
Also don’t forget to import the dbpedia_2014.owl file (first step in the script below)!
If you only want one directory’s files to be added (non recursive) use ld_dir('dir', '*.*', 'graph');.
If you manually want to add some files, use ld_add('file', 'graph');.
See the VirtBulkRDFLoaderScript file for details.

Be warned that it might be a bad idea to import the normal and i18n dataset into one graph if you didn’t select specific languages, as it might introduce a lot of duplicates.

In order to keep track (and easily reproduce) what was selected and imported into which graph, I actually link (ln -s) the repacked files into a directory structure beneath /usr/local/data/datasets/dbpedia/2014/importedGraphs/ and import from there instead. To make sure you think about this, I use that path below, so it won’t work if you didn’t pay attention. If you really want to import all downloaded files, just import /usr/local/data/datasets/dbpedia/2014/.

Also be aware of the fact that if you load certain parts of dumps in different graphs (such as I did with the pagelinks, as well as the i18n versions of the DE and FR datasets) that only triples from the http://dbpedia.org graph will be shown when you visit the local pages with your browser (SPARQL is unaffected by this)!

So if you want to load the same datasets as loaded on the official endpoint (but restricted to the EN and DE ones ) the following should do the trick to link them up for the next steps:

cd /usr/local/data/datasets/dbpedia/2014/
mkdir importedGraphs
cd importedGraphs

mkdir dbpedia.org
cd dbpedia.org
# ln -s ../../dbpedia_2014.owl ./ # see below!

ln -s ../../en/article_categories_en.nt.gz ./
ln -s ../../en/category_labels_en.nt.gz ./
ln -s ../../en/disambiguations_en.nt.gz ./
ln -s ../../en/geo_coordinates_en.nt.gz ./
ln -s ../../en/homepages_en.nt.gz ./
ln -s ../../en/images_en.nt.gz ./
ln -s ../../en/infobox_properties_en.nt.gz ./
ln -s ../../en/infobox_property_definitions_en.nt.gz ./
ln -s ../../en/instance_types_en.nt.gz ./
ln -s ../../en/instance_types_heuristic_en.nt.gz ./
ln -s ../../en/iri_same_as_uri_en.nt.gz ./
ln -s ../../en/labels_en.nt.gz ./
ln -s ../../en/long_abstracts_en.nt.gz ./
ln -s ../../en/mappingbased_properties_cleaned_en.nt.gz ./
ln -s ../../en/page_ids_en.nt.gz ./
ln -s ../../en/persondata_en.nt.gz ./
ln -s ../../en/redirects_transitive_en.nt.gz ./
ln -s ../../en/revision_ids_en.nt.gz ./
ln -s ../../en/revision_uris_en.nt.gz ./
ln -s ../../en/short_abstracts_en.nt.gz ./
ln -s ../../en/skos_categories_en.nt.gz ./
ln -s ../../en/specific_mappingbased_properties_en.nt.gz ./

ln -s ../../de/labels_en_uris_de.nt.gz ./
ln -s ../../de/long_abstracts_en_uris_de.nt.gz ./
ln -s ../../de/short_abstracts_en_uris_de.nt.gz ./

ln -s ../../fr/labels_en_uris_fr.nt.gz ./
ln -s ../../fr/long_abstracts_en_uris_fr.nt.gz ./
ln -s ../../fr/short_abstracts_en_uris_fr.nt.gz ./
cd ..

mkdir ext.dbpedia.org
cd ext.dbpedia.org
ln -s ../../en/genders_en.nt.gz ./
ln -s ../../en/out_degree_en.nt.gz ./
ln -s ../../en/page_length_en.nt.gz ./
cd ..

cd ..

mkdir topicalconcepts.dbpedia.org
cd topicalconcepts.dbpedia.org
ln -s ../../en/topical_concepts_en.nt.gz ./
cd ..

mkdir de.dbpedia.org
cd de.dbpedia.org
ln -s ../../de/article_categories_de.nt.gz ./
ln -s ../../de/category_labels_de.nt.gz ./
ln -s ../../de/disambiguations_de.nt.gz ./
ln -s ../../de/geo_coordinates_de.nt.gz ./
ln -s ../../de/homepages_de.nt.gz ./
ln -s ../../de/images_de.nt.gz ./
ln -s ../../de/infobox_properties_de.nt.gz ./
ln -s ../../de/infobox_property_definitions_de.nt.gz ./
ln -s ../../de/instance_types_de.nt.gz ./
ln -s ../../de/iri_same_as_uri_de.nt.gz ./
ln -s ../../de/labels_de.nt.gz ./
ln -s ../../de/long_abstracts_de.nt.gz ./
ln -s ../../de/mappingbased_properties_de.nt.gz ./
ln -s ../../de/out_degree_de.nt.gz ./
ln -s ../../de/page_ids_de.nt.gz ./
ln -s ../../de/page_length_de.nt.gz ./
ln -s ../../de/persondata_de.nt.gz ./
ln -s ../../de/pnd_de.nt.gz ./
ln -s ../../de/redirects_transitive_de.nt.gz ./
ln -s ../../de/revision_ids_de.nt.gz ./
ln -s ../../de/revision_uris_de.nt.gz ./
ln -s ../../de/short_abstracts_de.nt.gz ./
ln -s ../../de/skos_categories_de.nt.gz ./
ln -s ../../de/specific_mappingbased_properties_de.nt.gz ./
cd ..

cd ..


This should have prepared your importedGraphs directory. From this directory you can run the following command which print out the necessary isql commands to register your graphs for importing:

for g in * ; do echo "ld_dir_all('$(pwd)/$g', '*.*', 'http://$g');" ; done  One more thing (thanks to Romain): In order for the DBpedia.vad package (which is installed at the end) to work correctly, the dbpedia_2014.owl file needs to be imported into graph http://dbpedia.org/resource/classes#. Note: In the following i will assume that your virtuoso isql command is called isql. If you’re in lack of such a command it might be called isql-vt, but this usually means you installed it using some other method than described in here isql # enter virtuoso sql mode  -- we are in sql mode now ld_add('/usr/local/data/datasets/remote/dbpedia/2014/dbpedia_2014.owl', 'http://dbpedia.org/resource/classes#'); ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org', '*.*', 'http://dbpedia.org'); ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org', '*.*', 'http://de.dbpedia.org'); ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/ext.dbpedia.org', '*.*', 'http://ext.dbpedia.org'); ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/pagelinks.dbpedia.org', '*.*', 'http://pagelinks.dbpedia.org'); ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/pagelinks.de.dbpedia.org', '*.*', 'http://pagelinks.de.dbpedia.org'); ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/topicalconcepts.dbpedia.org', '*.*', 'http://topicalconcepts.dbpedia.org'); -- do the following to see which files were registered to be added: select * from DB.DBA.LOAD_LIST; -- if unsatisfied use: -- delete from DB.DBA.LOAD_LIST; EXIT;  You can now also register other datasets like Freebase, DBLP, Yago, Umbel and Schema.org … that you want to be loaded. Our full DB.DBA.LOAD_LIST currently looks like this: select ll_graph, ll_file from DB.DBA.LOAD_LIST;  ll_graph ll_file VARCHAR VARCHAR NOT NULL ____________________________________ http://dblp.l3s.de /usr/local/data/datasets/remote/dblp/l3s/2014-11-08/dblp.nt.gz http://dbpedia.org/resource/classes# /usr/local/data/datasets/remote/dbpedia/2014/dbpedia_2014.owl http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/amsterdammuseum_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/article_categories_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/bbcwildlife_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/bookmashup_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/bricklink_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/category_labels_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/cordis_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/dailymed_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/dblp_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/dbtune_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/disambiguations_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/diseasome_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/drugbank_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/eunis_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/eurostat_linkedstatistics_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/eurostat_wbsg_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/external_links_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/factbook_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/flickrwrappr_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/freebase_links_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/gadm_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/geo_coordinates_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/geonames_links_en_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/geospecies_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/gho_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/gutenberg_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/homepages_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/images_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/infobox_properties_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/infobox_property_definitions_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/instance_types_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/instance_types_heuristic_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/interlanguage_links_chapters_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/iri_same_as_uri_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/italian_public_schools_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/labels_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/labels_en_uris_de.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/labels_en_uris_fr.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/linkedgeodata_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/linkedmdb_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/long_abstracts_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/long_abstracts_en_uris_de.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/long_abstracts_en_uris_fr.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/mappingbased_properties_cleaned_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/musicbrainz_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/nytimes_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/opencyc_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/openei_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/page_ids_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/persondata_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/redirects_transitive_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/revision_ids_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/revision_uris_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/revyu_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/short_abstracts_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/short_abstracts_en_uris_de.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/short_abstracts_en_uris_fr.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/sider_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/skos_categories_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/specific_mappingbased_properties_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/tcm_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/umbel_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/uscensus_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/wikicompany_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/wikipedia_links_en.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/wordnet_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/yago_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/yago_taxonomy.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/yago_type_links.nt.gz http://dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/yago_types.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/article_categories_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/category_labels_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/disambiguations_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/external_links_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/freebase_links_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/geo_coordinates_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/homepages_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/images_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/infobox_properties_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/infobox_property_definitions_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/instance_types_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/interlanguage_links_chapters_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/iri_same_as_uri_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/labels_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/long_abstracts_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/mappingbased_properties_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/out_degree_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/page_ids_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/page_length_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/persondata_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/pnd_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/redirects_transitive_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/revision_ids_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/revision_uris_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/short_abstracts_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/skos_categories_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/specific_mappingbased_properties_de.nt.gz http://de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/wikipedia_links_de.nt.gz http://ext.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/ext.dbpedia.org/genders_en.nt.gz http://ext.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/ext.dbpedia.org/out_degree_en.nt.gz http://ext.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/ext.dbpedia.org/page_length_en.nt.gz http://pagelinks.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/pagelinks.dbpedia.org/page_links_en.nt.gz http://pagelinks.de.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/pagelinks.de.dbpedia.org/page_links_de.nt.gz http://topicalconcepts.dbpedia.org /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/topicalconcepts.dbpedia.org/topical_concepts_en.nt.gz http://rdf.freebase.com /usr/local/data/datasets/remote/freebase/2014-11-02/freebase-rdf-2014-11-02-00-00.gz http://schema.org /usr/local/data/datasets/remote/schema.org/2014-11-08/all.nt http://umbel.org/umbel/rc/ /usr/local/data/datasets/remote/umbel/External Ontologies/dbpediaOntology.n3 http://umbel.org/umbel/rc/ /usr/local/data/datasets/remote/umbel/External Ontologies/schema.org.n3 http://umbel.org/umbel /usr/local/data/datasets/remote/umbel/Ontology/umbel.n3 http://umbel.org/umbel/rc/ /usr/local/data/datasets/remote/umbel/Reference Structure/umbel_reference_concepts.n3 http://yago-knowledge.org/resource/ /usr/local/data/datasets/remote/yago/yago2/2012-12/yagoLabels.ttl.gz 114 Rows. -- 5 msec.  OK, now comes the fun (and long part: about 1.5 hours (new virtuoso 7 is cool 😉 for DBpedia alone, +~3 hours for Freebase)… After we registered the files to be added, now let’s finally start the process. Fire up screen if you didn’t already. (For more detailed metering than below see VirtTipsAndTricksGuideLDMeterUtility.) sudo aptitude install screen screen isql  rdf_loader_run(); -- DO NOT USE THE DB BESIDES THE FOLLOWING COMMANDS: -- depending on the amount of CPUs and your IO performance you can run -- more rdf_loader_run(); commands in other isql sessions which will -- speed up the import process. -- you can watch the progress from another isql session with: -- select * from DB.DBA.LOAD_LIST; -- if you need to stop the loading for any reason: rdf_load_stop (); -- if you want to force stopping: rdf_load_stop(1); checkpoint; commit work; checkpoint; EXIT;  After this: Take a look into var/lib/virtuoso/db/virtuoso.log file. Should you find any errors in there… FIX THEM! You might use the dump, but it’s incomplete then. Any error quits out of the loading of the corresponding file and continues with the next one, so you’re only using the part of that file up to the place where the error occurred. (Should you find errors you can’t fix please leave a comment.) ### Final polishing You can & should now install the DBpedia and RDF Mappers packages from the Virtuoso Conductor. http://your-server:8890 login: dba pw: dba  Go to System Admin / Packages. Install the dbpedia (v. 1.4.28) and rdf_mappers (v. 1.34.74) packages (takes about 5 minutes). ### Testing your local mirror Go to the sparql-endpoint of your server http://your-server:8890/sparql (or in isql prefix with: SPARQL) sparql SELECT count(*) WHERE { ?s ?p ?o } ;  This shouldn’t take long in Virtuoso 7 anymore and for me now returns 695,553,624 for DBpedia (en+de), 3,543,872,243 with DBpedia (en+de), Freebase, DBLP, Yago, Umbel and Schema.org. I also like this query showing all the graphs and how many triples are in them: sparql SELECT ?g COUNT(*) { GRAPH ?g {?s ?p ?o.} } GROUP BY ?g ORDER BY DESC 2; g callret-1 LONG VARCHAR LONG VARCHAR ____________________________________________________________ http://rdf.freebase.com 2760013365 http://dbpedia.org 375176108 http://pagelinks.dbpedia.org 149707899 http://de.dbpedia.org 92508750 http://dblp.l3s.de 72519345 http://pagelinks.de.dbpedia.org 55804533 http://ext.dbpedia.org 21900162 http://yago-knowledge.org/resource 15372307 http://umbel.org/umbel/rc 403452 http://www.openlinksw.com/schemas/RDF_Mapper_Ontology/1.0/ 256065 http://topicalconcepts.dbpedia.org 149638 http://dbpedia.org/resource/classes 27063 http://schema.org 8727 http://localhost:8890/DAV/ 6187 http://www.openlinksw.com/schemas/virtrdf# 2639 http://umbel.org/umbel 1702 http://open.vocab.org/terms 1480 http://purl.org/ontology/bibo/ 1226 http://purl.org/goodrelations/v1 937 http://purl.org/dc/terms/ 857 http://www.openlinksw.com/schemas/opengraph 804 http://www.openlinksw.com/schemas/linkedin 741 http://www.openlinksw.com/schemas/googleplus 696 http://www.openlinksw.com/schemas/google-base 691 http://www.openlinksw.com/schemas/cv 661 virtrdf-label 638 http://xmlns.com/foaf/0.1/ 557 http://rdfs.org/sioc/ns# 553 http://www.openlinksw.com/schemas/evri 482 http://www.openlinksw.com/schemas/crunchbase 444 http://bblfish.net/work/atom-owl/2006-06-06/ 386 http://scot-project.org/scot/ns# 332 http://www.openlinksw.com/schemas/zillow 311 http://www.w3.org/2004/02/skos/core 252 http://www.openlinksw.com/schemas/cnet 225 http://www.openlinksw.com/schemas/tesco 183 http://www.openlinksw.com/schemas/bestbuy 172 http://www.w3.org/2002/07/owl# 160 http://www.w3.org/2002/07/owl 160 http://www.openlinksw.com/schemas/angel# 144 http://www.openlinksw.com/schemas/amazon 143 http://purl.org/dc/elements/1.1/ 139 http://www.w3.org/2007/05/powder-s# 117 http://www.openlinksw.com/schemas/twitter 103 http://www.openlinksw.com/schemas/stackoverflow# 102 http://www.openlinksw.com/schemas/klout 90 http://www.w3.org/2000/01/rdf-schema# 87 http://www.w3.org/1999/02/22-rdf-syntax-ns# 85 http://www.openlinksw.com/schemas/ebay 79 http://www.openlinksw.com/schema/attribution# 68 http://www.openlinksw.com/schemas/nyt 41 http://www.openlinksw.com/schemas/wolframalpha# 32 http://www.openlinksw.com/schemas/oplbase 26 http://www.openlinksw.com/schemas/cert# 23 http://www.openlinksw.com/schemas/money 21 http://www.openlinksw.com/schemas/dbpedia-spotlight# 21 http://localhost:8890/sparql 14 http://dbpedia.org/schema/property_rules# 12 dbprdf-label 6 59 Rows. -- 61717 msec.  Congratulations, you just imported over half a billion triples (or over 3.5 G triples). ### Backing up this initial state Now is a good moment to backup the whole db (takes about half an hour): sudo -i cd / /etc/init.d/virtuoso-opensource stop && tar -cvf - /var/lib/virtuoso | lzop > virtuoso-7.1.0-DBDUMP-$(date '+%F')-dbpedia-2014-en_de.tar.lzop &&
/etc/init.d/virtuoso-opensource start


Afterwards you might want to repack this with xz (lzma) like this:

# aptitude install xz
for f in virtuoso-7.1.0-DBDUMP-*.tar.lzop ; do lzop -d -c "$f" | xz > "${f%lzop}.xz" ; done


Yay, done 😉
As always, feel free to leave comments if i made a mistake or to tell us about your problems or how happy you are :D.

### Our database dump file

In case you really want exactly the same state of the public datasets that we have loaded (as described above) you can download our database dump (57 GB, md5sum, including: DBpedia 2014 en,de,links,dbpedia_2014.owl, Freebase, DBLP, Yago, Umbel and Schema.org).

### Thanks

Many thanks to the DBpedia team for their endless efforts of providing us all with a great dataset. Also many thanks to the Virtuoso crew for releasing an opensource version of their DB.

• 2014-11-24: Thanks to Romain: Load dbpedia_2014.owl into graph http://dbpedia.org/resource/classes# for DBpedia.vad to find it when resolving http://your-server:8890/ontology/author for example.

# Setting up a local DBpedia 3.9 mirror with Virtuoso 7

I just found this aged post in my drafts folder, maybe someone will still like it…

So you’re the guy who is allowed to setup a local DBpedia mirror or more generally a local Linked Data mirror for your work group? OK, today is your lucky day and you’re in the right place. I hope you’ll be able to benefit from my many hours of trials and errors. If anything goes wrong, feel free to leave me a comment below.

## Versions of this guide

There are two older versions of this guide:

• Oct. 2010: The first version focusing on DBpedia 3.5 – 3.6 and Virtuoso 6.1
• May 2012: A bigger update to DBpedia 3.7 (new local language versions) and Virtuoso 6.1.5+ (with a lot of updates making pre-processing of the dumps easier)

With the recent release of Virtuoso 7 (way faster, thanks to Openlink!) and DBpedia 3.9 i again felt the urge to update this guide as a couple of things changed.

In this step by step guide I’ll tell you how to install a local Linked Data mirror of the DBpedia 3.9 hosting a combination of the regular English and (exemplary) the i18n German datasets adding up to nearly half a billion triples.

Let’s jump in.

## Used Versions

• DBpedia 3.9 + 3.9-i18n dataset
• Virtuoso OpenSource 7.0.0
• Ubuntu 12.04 LTS

## Prerequesits

A strong machine with root access and enough RAM: We use a VM with 4 Cores and 32 GBs of RAM. For installing i recommend more than 128 GB free HD space, especially for downloading and repacking the datasets, as well as the growing database file when importing (mine grew to 41 GBs).

## Let’s go

Go and download virtuoso opensource: either from http://sourceforge.net/projects/virtuoso/ (make sure you get v7.0.0 as in this guide or newer version).

Put the file in your home dir on the server, then extract it and switch to the directory:

cd ~
tar -xvzf virtuoso-7.0.0.tar.gz
cd virtuoso-opensource-7.0.0 # or newer, depending what you got


Now do the following to install the prerequisites and then build virtuoso:

sudo aptitude install libxml2-dev libssl-dev autoconf libgraphviz-dev
libmagickcore-dev libmagickwand-dev dnsutils gawk bison flex gperf

# NOTICE: this will _not_ install into /usr/local but into /usr
# (so might clash with packages by your distribution if you install
# "the" virtuoso package)
# You'll find the db in /var/lib/virtuoso/db !
# check output for errors and FIX THEM! (e.g., install missing packages)
export CFLAGS="-O2 -m64"
./configure --with-layout=debian

# the following will build with 5 processes in parallel
# choose something like your server's #CPUs + 1
make -j5


This will take about 5 min

sudo make install


Now change the following values in /var/lib/virtuoso/db/virtuoso.ini, the performance tuning stuff is according to http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtRDFPerformanceTuning:

# note: virtuoso ignores lines starting with whitespace and stuff after a ;
[Parameters]
# to, in our case /usr/local/data/datasets:
# IMPORTANT: for performance also do this
[Parameters]
# the following two are as suggested by comments in the original .ini
# file in order to use the RAM on your server:
NumberOfBuffers = 2720000
MaxDirtyBuffers = 2000000
# each buffer caches a 8K page of data and occupies approx. 8700 bytes of
# memory. It's suggested to set this value to 65 % of ram for a db only server
# so if you have 32 GB of ram: 32*1000^3*0.65/8700 = 2390804
# default is 2000 which will use 16 MB ram ;)
# Make sure to remove whitespace if you uncomment existing lines!
[Database]
MaxCheckpointRemap = 625000
# set this to 1/4th of NumberOfBuffers
[SPARQL]
# I like to increase the ResultSetMaxrows, MaxQueryCostEstimationTime
# and MaxQueryExecutionTime drastically as it's a local store where we
# do quite complex queries... up to you (don't do this if a lot of people
# use it).
# In any case for the importer to be more robust add the following setting
# to this section:
ShortenLongURIs = 1


The next step installs an init-script (autostart) and starts the virtuoso server. (If you’ve changed directories to edit /var/lib/virtuoso/db/virtuoso.ini, go back to the virtuoso source dir!):

sudo cp debian/init.d /etc/init.d/virtuoso-opensource &&
sudo chmod a+x /etc/init.d/virtuoso-opensource &&
sudo bash debian/virtuoso-opensource.postinst.debhelper


You should now have a running virtuoso server.

### DBpedia URIs (en) vs. DBpedia IRIs (i18n)

The DBpedia 3.9 consists of several datasets: one “standard” English version and several localized versions for other languages (i18n). The standard version mints URIs by going through all English Wikipedia articles. For all of these the Wikipedia cross-language links are used to extract corresponding labels in other languages for the en URIs (e.g., de/labels_en_uris_de.nt.bz2). This is problematic as for example articles which are only in the German Wikipedia won’t be extracted. To solve this problem the i18n versions exists and create IRIs in the form of de.dbpedia.org for every article in the German Wikipedia (e.g., de/labels_de.nt.bz2).

This approach has several implications. For backwards compatibility reasons the standard DBpedia makes statements about URIs such as http://dbpedia.org/resource/Gerhard_Schr%C3%B6der while the local chapters, like the German one, make statements about IRIs such as http://de.dbpedia.org/resource/Gerhard_Schröder (note the ö). In other words and as written above: the standard DBpedia uses URIs to identify things, while the localized versions use IRIs. This also means that http://dbpedia.org/resource/Gerhard_Schröder shouldn’t work. That said, clicking the link will actually work as there is magic going on in your browser to give you what you probably meant. Using curl curl -i -L -H "Accept: application/rdf+xml" http://dbpedia.org/resource/Gerhard_Schröder or SPARQLing the endpoint will nevertheless not be so nice/sloppy and can cause quite some headache: select * where { dbpedia:Gerhard_Schröder ?p ?o. } vs. select * where { <http://dbpedia.org/resource/Gerhard_Schr%C3%B6der> ?p ?o. }. In order to mitigate this historic problem a bit DBpedia actually offers owl:sameAs links from IRIs to URIs: en/iri_same_as_uri_en which you should load, so you at least have a link to what you want if someone tries to get info about an IRI.

As if this isn’t confusing enough there is another trap: If you were to download the .ttl files then you suddenly have all statements associated with the IRI for the standard DBpedia (unlike the online endpoint). The only reason i can think of for this inconsistency is that at some point the actual inconsisty of URIs in EN vs IRIs in everything else will be resolved. For now these files are most certainly not what you want! So use the .nt files!

As the standard DBpedia provides labels, abstracts and a couple other things in several languages, there are two types of files in the localized DBpedia folders: There are triples directly associating the English URIs with for example the German labels (de/labels_en_uris_de) and there are the localized triple files which associate for example the DE IRIs with the German labels (de/labels_de).

For our group we decided that we wanted a reasonably complete mirror of the standard DBpedia (EN) (have a look at datasets loaded into the public DBpedia SPARQL Endpoint), but also the i18n versions for the German and French DBpedia loaded in separate graphs, as well as each of their pagelink datasets in another separate graph. For this we download the corresponding files in (NT) format (also see previous section with remarks about the TTL files!). If you need something different do so (and maybe report back if there were problems and how you solved them).

Another hint: Virtuoso can only import plain (uncompressed) or gzipped files, the DBpedia dumps are bzipped, so you either repack them into gzip format or extract them. On our server the importing procedure was reasonably slower from extracted files than from gzipped ones (ignoring the vast amount of wasted disk space for the extracted files). File access becomes a bottleneck if you have 4 cores idling. This is why I decided on repacking all the files from bz2 to gz. As you can see I do the repacking per folder in parallel, if that’s not suitable for you, feel free to change it. You might also want to change this if you want to do it in parallel to downloading. The repackaging process below took about 1 hour but was worth it in the end. The more CPUs you have, the more you can parallelize this process.

sudo -i # get root
# see comment above, you could also get the all_language.tar or another DBpedia version...
mkdir -p /usr/local/data/datasets/dbpedia/3.9
cd /usr/local/data/datasets/dbpedia/3.9

# if you want to save space do this:
for d in */ ; do for i in "${d%/}"/*.bz2 ; do bzcat "$i" | gzip > "${i%.bz2}.gz" && rm "$i" ; done & done
# else do:
#bunzip2 */*.bz2 &

# notice that the extraction (and repacking) of *.bz2 takes quite a while (about 1 hour)
# gzipped data is reasonably packed, but still very fast to access (in contrast to bz2), so maybe this is the best choice.


### Data Cleaning and The bulk loader scripts

In contrast to the previous versions of this article the virtuoso import will take care of shortening too long IRIs itself. Also it seems the bulk loader script is included in the more recent Virtuoso versions, so as a reference only: see the old version for the cleaning script and VirtBulkRDFLoaderExampleDbpedia and

### Importing DBpedia dumps into virtuoso

Now AFTER the re-/unpacking of the DBpedia dumps we will register all files in the dbpedia dir (recursively ld_dir_all) to be added to the dbpedia graph. If you use this method make sure that only files reside in the given subtree that you really want to import.
Also don’t forget to import the dbpedia_3.9.owl file (last step in the script below)!
If you only want one directory’s files to be added (non recursive) use ld_dir.
If you manually want to add some files, use ld_add.
See the VirtBulkRDFLoaderScript file for args to pass.

Be warned that it might be a bad idea to import the normal and i18n dataset into one graph if you didn’t select specific languages, as it might introduce a lot of duplicates.

In order to keep track what was selected and imported into which graph, I actually link (ln -s) the repacked files into a directory structure beneath /usr/local/data/datasets/dbpedia/3.9/importedGraphs/ and import from there instead. To make sure you think about this, I use that path below, so it won’t work if you didn’t pay attention. If you really want to import all downloaded files, just import /usr/local/data/datasets/dbpedia/3.9/.

Also be aware of the fact that if you load certain parts of dumps in different graphs (such as I did with the pagelinks, as well as the i18n versions of the DE and FR datasets) that only triples from the http://dbpedia.org graph will be shown when you visit the local pages with your browser (SPARQL is unaffected by this)!

So if you want to load the same datasets as loaded on the official endpoint (but restricted to the EN,DE and FR ones ) the following should do the trick to link them up for the next steps:

cd /usr/local/data/datasets/dbpedia/3.9/
mkdir -p importedGraphs/dbpedia.org
cd importedGraphs/dbpedia.org
ln -s
../../en/article_categories_en.nt.gz
../../en/category_labels_en.nt.gz
../../en/disambiguations_en.nt.gz
../../en/geo_coordinates_en.nt.gz
../../en/homepages_en.nt.gz
../../en/images_en.nt.gz
../../en/instance_types_en.nt.gz
../../en/instance_types_heuristic_en.nt.gz
../../en/iri_same_as_uri_en.nt.gz
../../en/labels_en.nt.gz
../../en/long_abstracts_en.nt.gz
../../en/mappingbased_properties_cleaned_en.nt.gz
../../en/page_ids_en.nt.gz
../../en/persondata_en.nt.gz
../../en/pnd_en.nt.gz
../../en/raw_infobox_properties_en.nt.gz
../../en/raw_infobox_property_definitions_en.nt.gz
../../en/redirects_transitive_en.nt.gz
../../en/revision_ids_en.nt.gz
../../en/revision_uris_en.nt.gz
../../en/short_abstracts_en.nt.gz
../../en/skos_categories_en.nt.gz
../../en/specific_mappingbased_properties_en.nt.gz
../../de/labels_en_uris_de.nt.gz
../../de/long_abstracts_en_uris_de.nt.gz
../../de/pnd_en_uris_de.nt.gz
../../de/short_abstracts_en_uris_de.nt.gz
../../fr/labels_en_uris_fr.nt.gz
../../fr/long_abstracts_en_uris_fr.nt.gz
../../fr/short_abstracts_en_uris_fr.nt.gz
../../dbpedia_3.9.owl
./


Note: in the following i will assume that your virtuoso isql command is called isql. If you’re in lack of such a command it might be called isql-vt, but this usually means you installed it using some other method than described in here

isql # enter virtuoso sql mode

-- we are in sql mode now
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/3.9/importedGraphs/dbpedia.org', '*.*', 'http://dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/3.9/importedGraphs/de.dbpedia.org', '*.*', 'http://de.dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/3.9/importedGraphs/topicalconcepts.dbpedia.org', '*.*', 'http://topicalconcepts.dbpedia.org');

-- do the following to see which files were registered to be added:
-- if unsatisfied use:
EXIT;


OK, now comes the fun (and long part: about 1.5 hours (new virtuoso 7 is cool 😉 )… We registered the files to be added, now let’s finally start the process. Fire up screen if you didn’t already.

sudo aptitude install screen
screen isql

rdf_loader_run();
-- DO NOT USE THE DB BESIDES THE FOLLOWING COMMANDS:
-- (I had some warnings about a possibly corrupt db in the log,
-- when I visited the virtuoso conductor during the first run...)
-- you can watch the progress from another isql session with:
-- if you want to force stopping: rdf_load_stop(1);
checkpoint;
commit work;
checkpoint;
EXIT;


After this:
Take a look into var/lib/virtuoso/db/virtuoso.log file. Should you find any errors in there… FIX THEM! You might use the dump, but it’s incomplete then. Any error quits out of the loading of the corresponding file and continues with the next one, so you’re only using the part of that file up to the place where the error occurred. (Should you find errors you can’t fix in the way I did above, please leave a comment.)

### Final polishing

You can & should now install the DBpedia and RDF Mappers packages from the Virtuoso Conductor.
http://your-server:8890

login: dba
pw: dba


Go to System Admin / Packages. Install the dbpedia (v. 1.3.83) and rdf_mappers (v. 1.34.72) packages (takes about 5 minutes).

Go to the sparql-endpoint of your server http://your-server:8890/sparql (or in isql prefix with: SPARQL)

sparql SELECT count(*) WHERE { ?s ?p ?o } ;


This shouldn’t take long in Virtuoso 7 anymore and for me now returns 567,173,934.
I also like this query showing all the graphs and how many triples are in them:

sparql SELECT ?g COUNT(*) { GRAPH ?g {?s ?p ?o.} } GROUP BY ?g ORDER BY DESC 2;
g                                                          callret-1
LONG VARCHAR                                               LONG VARCHAR
___________________________________________________________

http://dbpedia.org                                         312505120
http://de.dbpedia.org                                      67997676
http://topicalconcepts.dbpedia.org                         136887
http://localhost:8890/DAV/                                 4709
http://open.vocab.org/terms                                1480
http://purl.org/ontology/bibo/                             1226
http://purl.org/goodrelations/v1                           937
http://purl.org/dc/terms/                                  857
virtrdf-label                                              638
http://xmlns.com/foaf/0.1/                                 557
http://rdfs.org/sioc/ns#                                   553
http://bblfish.net/work/atom-owl/2006-06-06/               386
http://scot-project.org/scot/ns#                           332
http://www.w3.org/2004/02/skos/core                        252
http://www.w3.org/2002/07/owl#                             160
http://www.w3.org/2002/07/owl                              160
http://purl.org/dc/elements/1.1/                           139
http://www.w3.org/2007/05/powder-s#                        117
http://www.w3.org/2000/01/rdf-schema#                      87
http://www.w3.org/1999/02/22-rdf-syntax-ns#                85
http://localhost:8890/sparql                               14
http://dbpedia.org/schema/property_rules#                  12
dbprdf-label                                               6

51 Rows. -- 4563 msec.


Congratulations, you just imported nearly half a billion triples.

### Backing up this initial state

Now is a good moment to backup the whole db (takes about half an hour):

sudo -i
cd /
/etc/init.d/virtuoso-opensource stop &&
tar -cvjf virtuoso-7.0.0-DBDUMP-dbpedia-3.9-en_de-$(date '+%F').tar.bz2 /var/lib/virtuoso && /etc/init.d/virtuoso-opensource start  Yay, done 😉 As always, feel free to leave comments if i made a mistake or to tell us about your problems or how happy you are :D. ### Thanks Many thanks to the DBpedia team for their endless efforts of providing us all with a great dataset. Also many thanks to the Virtuoso crew for releasing an opensource version of their DB. # Scientific Python on Mac OS X 10.9+ with homebrew There are (too) many guides out there about how to install Python on Mac OS X. So why another one? Cause i found most of the other ways to lead to long-term maintenance debt that is unnecessary and can easily be avoided. Also many of the other guides aren’t very easy to extend in “but i also need that library”-cases. So here I’ll briefly explain how to set-up a scientific python stack based on homebrew. The advantages are that this stack will never conflict with your system’s core and homebrew opens up a whole new world of easy to access unix tools via a simple brew install .... This step-by-step installation guide to setup a scientific python environment has been tested on Mac OS X Mavericks 10.9 / Yosemite 10.10 / El Capitain 10.11 / Sierra 10.12. It will probably also work in the following versions (if not, let me know in the comments). An older version of this setup guide can be found here: Scientific Python on Mac OS X 10.8 with homebrew, main changes: rename of a tap + changes wrt. openblas as Accelerate was fixed in OS X 10.9 Needless to say: Make a backup (Timemachine) First install homebrew. Follow their instructions, then come back here. If you don’t have a clean install, some of the following steps might need minor additional attention (like changing permissions chmod, chown, chgrp or overwriting existing files in the linking step with brew link --overwrite package_that_failed. In this case i can only recommend a backup again). In general: execute the following commands one line at a time and read the outputs! If you read some warnings about “keg-only” that’s fine, it just means that brew won’t “hide” your system’s stuff behind the stuff it installed itself so it doesn’t cause problems… brewed stuff will still use it. # set up some taps and update brew brew tap homebrew/science # a lot of cool formulae for scientific tools brew tap homebrew/python # numpy, scipy, matplotlib, ... brew update && brew upgrade # install a brewed python brew install python # and/or if you want to give python3 a go (it's about time): #brew install python3 # repeat/replace the following pip commands with pip3  A word about brewed python: this is what you want! It’s more up to date than the system python, it will come with pip and correctly install in the brew directories, working together well with brewed python libs that aren’t installable with plain pip. This also means that pip by default will work without sudo as all of homebrew, so if you ever run or have to run sudo pip ... because of missing permissions, then you’re doing it wrong! Also, don’t be afraid of multiple pythons on your system: you have them anyhow (python2 and python3) and it’s an advantage, as we’ll make sure that nothing poisons your system python and that you as a user & developer will use the brewed python: hash -r python # invalidate bash's lookup cache for python which python # should say /usr/local/bin/python echo$PATH
# /usr/local/bin should appear in front of /usr/bin


If this is not the case you’d probably end up not using brewed python. Please check your brew install with brew doctor, it will probably tell you that you should consider updating your paths in ~/.bashrc. You can either follow its directions or create a ~/.profile file like this one: ~/.profile. If you performed these steps, please close your terminal windows and open a new one for the changes to take effect. Test the above again.

Even if the above check worked, run the following anyhow and read through its output (no output is good):

brew doctor


Pay special attention if this tells you to install XQuartz, and if it does, install it! You’ll need it anyhow…

Now after these preparations, let’s really start installing stuff… below you’ll mostly find one package / lib per line. For each of them and for their possible options: they’re a recommendation that might save you some trouble, so i’d recommend to install all of them as i write here, even if specifying some of the options will compile brew packages from source and take a bit longer…

# install PIL, imagemagick, graphviz and other
# image generating stuff
brew install libtiff libjpeg webp little-cms2
pip install Pillow
brew install imagemagick --with-fftw --with-librsvg --with-x11
brew install graphviz --with-librsvg --with-x11
brew install cairo
brew install qt pyqt5

# install virtualenv, nose (unittests & doctests on steroids)
pip install virtualenv
pip install nose

# install numpy and scipy
# nowadays there are two good ways to install numpy and scipy: via pip or via brew.
# PICK ONE:
# - i prefer pip for proper virtualenv support and more up-to-date versions.
# - brew builds are a bit older, but handy if you need a source build
pip install numpy
pip install scipy
# OR:
# (if you want to run numpy and scipy with openblas also remove comments below:)
#brew install openblas
#brew install numpy # --with-openblas
#brew install scipy # --with-openblas

# test the numpy & scipy install
python -c 'import numpy ; numpy.test();'
python -c 'import scipy ; scipy.test();'

# some cool python libs (if you don't know them, look them up)
# matplotlib: generate plots
# pandas: time series stuff
# nltk: natural language toolkit
# sympy: symbolic maths in python
# q: fancy debugging output
# snakeviz: cool visualization of profiling output (aka what's taking so long?)
#brew install Caskroom/cask/mactex  # if you want to install matplotlib with tex support and don't have mactex installed already
brew install matplotlib --with-cairo --with-tex  # cairo: png ps pdf svg filetypes, tex: tex fonts & formula in plots
pip install pandas
pip install nltk
pip install sympy
pip install q
pip install snakeviz

# ipython/jupyter with parallel and notebook support
brew install zmq
pip install ipython[all]

# html stuff (parsing)
pip install html5lib cssselect pyquery lxml BeautifulSoup

# webapps / apis (choose what you like)

# semantic web stuff: rdf & sparql
pip install rdflib SPARQLWrapper

# graphs (graph metrics, social network analysis, layouting)
pip install networkx

# maintenance: updating of pip libs
pip list --outdated  # see Updating section below


Have fun 😉

## Updating

OK, let’s say it’s been a while since you installed things with this guide and you now want to update all the installed libs. To do this you should first upgrade everything installed with brew like this:

brew update && brew outdated && brew upgrade


Afterwards for upgrading pip packages (globally or in a virtualenv) you can just run

pip list --outdated


to get a list of outdated packages and then manually update them with:

pip install -U package1 package2 ...


If you want a tiny bit more comfort you can use the pip-review package to do this interactively:

pip install pip-review


Once installed you should be able to run the following either in a virtualenv or globally for your whole system:

pip-review -i # for interactive mode, -a to upgrade all which is dangerous


It will check your installed packages for new versions and give you a list of outdated packages. I’d recommend to run it with the -i option to interactively install the upgrades.

A word of warning about the brewed packages: If i recommended to install a package with brew above that’s usually for a good reason like the pip version not working properly. If you’re a bit more advanced, you can try to upgrade them with pip, but i’d recommend to properly unlink them with brew unlink <package> before, as some pip packages might run into problems otherwise. If you find the pip package works like a charm then, please let me know in the comments below so i can update this guide. In general i prefer the pip packages as they’re more up to date, work in virtual environments and can then easily be updated with pip-review.

2014-03-02: include checking of \$PATH for Mike
2015-03-17: enhanced many explanations, provided some useful options for packages, general workover
2015-06-05: Pillow via pip and Updating section
2015-11-01: pip-review (was detached from pip-tools) + alternative
2016-02-15: hash -r python (invalidate bash python bin lookup cache)
2016-12-21: updates for sierra, brew upgrade, python3 and some more comments
2017-03-30: updated pyqt’s package name to pyqt5

# Scientific Python on Mac OS X 10.8 with homebrew

A step-by-step installation guide to setup a scientific python environment based on Mac OS X and homebrew.

Needless to say: Make a backup (Timemachine)

First install homebrew.
Follow their instructions, then come back here.

If you don’t have a clean install, some of the following steps might need minor additional attention (like changing permissions chmod, chown, chgrp or overwriting existing files in the linking step with brew link --overwrite package_that_failed. In this case i can only recommend a backup again).

In general: execute the following commands one line at a time and read the outputs! If you read some warnings about “keg-only” that’s fine, it just means that brew won’t “hide” your system’s stuff behind the stuff it installed itself so it doesn’t cause problems… brewed stuff will still use it.

# set up some taps and update brew
brew tap homebrew/science # a lot of cool formulae for scientific tools
brew tap homebrew/python # numpy, scipy

# install a brewed python
brew install python

# install openblas (otherwise scipy's arpack tests will fail)
brew install openblas

# install PIL, imagemagick, graphviz and other
# image generating stuff (qt is nice for viewing)
brew install pillow imagemagick graphviz
brew install cairo --without-x
brew install qt pyqt

# install nose (unittests & doctests on steroids)
pip install virtualenv nose

# install numpy and scipy
brew install numpy --with-openblas # bug in Accelerate framework < Mac OS X 10.9
brew install scipy --with-openblas # bug in Accelerate framework < Mac OS X 10.9

# test the scipy install
brew test scipy

# some cool python libs (if you don't know them, look them up)
# time series stuff, natural language toolkit
# generate plots, symbolic maths in python, fancy debugging output
pip install pandas nltk matplotlib sympy q

# ipython and notebook support
brew install zmq
pip install ipython[zmq,qtconsole,notebook,test]

# html stuff (parsing)
pip install html5lib cssselect pyquery lxml BeautifulSoup

# webapps / apis (choose what you like)

# semantic web stuff: rdf & sparql
pip install rdflib SPARQLWrapper

# picloud (easily run python scripts in the cloud)
pip install cloud


Have fun 😉

update 2014-02-25: updated tap samualjohn/python to homebrew/python, new version linked

Recently, a post about Hacking the <a> tag in 100 characters made its rounds on the internet.
A short summary of this post is that it’s possible to change the link target (the href attribute) just after it’s clicked, but before the target is loaded. This means that hovering over a link and seeing a target such as http://paypal.com you might actually be redirected to some phishing page.

Here’s an easy js-fiddle demonstrating the issue which actually presents a popup before you are redirected (see the Result tab for the demo):

So you might think this is bad and maybe you’re right. Bilaw suggests that browsers should notify their users in case the link target was changed in between click and following the link in his post.

While this might seems legit, it’s actually not that simple: There’s at least two alternatives to achieve the same linkception (not the SEO term) without changing the link target:

1. You can just “redirect” the click to another link. That other link can even be hidden.
2. You can stop the click event and just load another page.

Both are shown in the js-fiddle below:

So fixing this kind of phishing attack isn’t all that simple. You could argue that you shouldn’t be able to bind anything to click events on a-tags, but then there are some good use-cases for this (like you want to make the user confirm following a link).

As it’s quite simple to obfuscate the few examples i’ve shown with callbacks, timers, etc., i’m quite sure that fixing them wouldn’t be the end of the arms race. Maybe the only thing we can actually do is raise awareness.

PS: Depending on your browser the click event is triggered on different actions: Firefox only triggers it on left mouse-button click, Chrome also on middle mouse-button click. Both trigger the click event on CTRL/CMD+left mouse-button (opening the link in a new tab), both don’t trigger the event on right click + open in new tab.