Ever wondered what the top subjects / predicates / objects are in DBpedia?
I recently came across this problem while trying to draw a random sample of nodes from DBpedia which follow a given degree distribution for my PhD.
Turns out this is actually more difficult than i expected. Mostly due to the fact that quad stores don’t optimize for such queries. This means that you can’t just ask a SPARQL endpoint (not even your local one) to give you the top subjects, predicates or objects with a query like this:
select ?n count(*) as ?c
where {
?n ?p ?o.
}
order by desc(?c)
limit 10
Try yourself here if you don’t believe me… (i set it to time out after 15 seconds and it will return quite a dangerously nonsensical result if you’re not aware that you might get partial answers).
Some Rant
So this lead me to the fascinating conclusion that our beloved RDF query language doesn’t even allow us to answer simple questions such as “which node is most often used as a subject / predicate / object?” (we’re talking with a single SPARQL endpoint here, don’t even try dragging me into an open/closed world assumption discussion, …).
So, all is great, let’s just not ask those evil questions…
… said no (computer) scientist ever.
So let’s get our hands dirty and use some unix tool magic…
Working with Dumps in NT Format
Luckily, I already had all the dumps laid out locally as described here, and lucky again, they are in N-Triples format.
N-Triples is a line based format, which means we have exactly one triple per line. I don’t exactly know whom to thank for this, but should you ever read this (wait, why are you reading my blog?) THANK YOU. It means that neither subject nor predicate nor object can contain (unescaped) newlines. And this means that you can actually quite sanely sort and parse .nt files with standard unix tools that have been optimized by generations of smart people.
I think you see where this is going: a good old bash one-liner with grep, cut, sort and uniq, by far the fastest tools i know for the job.
A Word about Sort Orders and Locales
Sort orders depend on your locale! This means that files sorted with a locale such as en_US.UTF-8 are not properly sorted for someone with a locale such as de_DE.UTF-8. Hence it’s wise to always run this in a shell before working with sort:
export LC_ALL=C
It resets your locale to a classic C byte-wise one, having the nice side effect that it’s faster as well.
Deduplication
First, it turns out the DBpedia dumps actually contain quite an astonishing amount of duplicate triples. This is not a problem if loaded into a quad store as they’ll just count once, but for counting them like we will, it is a problem.
To split them apart let’s do the following: we pick up all the dump files that are loaded into our endpoint with pv
a handy little tool similar to cat
, but it shows a nice progress bar. Then we decompress with zcat
, remove comments from the files with grep
and then call sort
. We actually tell sort to use a ton of RAM (32 GB), but actually not even that is enough for the > 80 GB decompressed dumps, so we need temp files. We can direct sort to put them onto an SSD instead of just in /tmp
as by default, and we can also compress those temp files on the fly. lzop
is a very fast compression tool and the perfect fit for this (not compressing the files with this actually degrades performance even at 300 MB/s write speeds of the SSD!). After this we use tee
to multiplex our stream into two channels: one plain uniq
and gzipped with pigz
(like gzip
but parallel, as gzipping > 80 GB becomes quite the bottleneck here otherwise) into dbpedia_uniq.nt.gz
and another invocation of uniq -c -d
which only counts the duplicate lines and gzip
s (this is ok to be single threaded, as it’s not sooo big) them into dbpedia_dups.txt.gz
.
pv /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/{dbpedia.org,ext.dbpedia.org,pagelinks.dbpedia.org,topicalconcepts.dbpedia.org}/* |
zcat | # decompress
grep -v -E '^\s*#' | # ignore comments in the nt files
sort -S32G -T/ssd/tmp/ --compress-program=lzop |
tee \
>( uniq | pigz > dbpedia_uniq.nt.gz ) \
>( uniq -c -d | gzip > dbpedia_dups.txt.gz ) \
>/dev/null
As you can see from the first line include external, pagelinks and topicalconcepts datasets, but the process is really the same no matter what.
After ~ 10 minutes we’re left with a 6.5 GB dbpedia_uniq.nt.gz (547,084,682 unique triples) and a 238 MB dbpedia_dups.txt.gz.
Top Duplicates
The top duplicates as acquired with
zcat dbpedia_dups.txt.gz | sort -n -r -S8G | head -n20
are (full file (238 MB)):
4891 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_Slovenia.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Flag_of_Slovenia.svg> .
4891 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_Slovenia.svg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_Slovenia.svg?width=300> .
4891 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_Slovenia.svg> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Flag_of_Slovenia.svg> .
1520 <http://commons.wikimedia.org/wiki/Special:FilePath/Naval_Ensign_of_the_United_Kingdom.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Naval_Ensign_of_the_United_Kingdom.svg> .
1520 <http://commons.wikimedia.org/wiki/Special:FilePath/Naval_Ensign_of_the_United_Kingdom.svg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Naval_Ensign_of_the_United_Kingdom.svg?width=300> .
1520 <http://commons.wikimedia.org/wiki/Special:FilePath/Naval_Ensign_of_the_United_Kingdom.svg> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Naval_Ensign_of_the_United_Kingdom.svg> .
1195 <http://commons.wikimedia.org/wiki/Special:FilePath/Airplane_silhouette.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Airplane_silhouette.svg> .
1195 <http://commons.wikimedia.org/wiki/Special:FilePath/Airplane_silhouette.svg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Airplane_silhouette.svg?width=300> .
1195 <http://commons.wikimedia.org/wiki/Special:FilePath/Airplane_silhouette.svg> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Airplane_silhouette.svg> .
1188 <http://commons.wikimedia.org/wiki/Special:FilePath/Med_1.png?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Med_1.png> .
1188 <http://commons.wikimedia.org/wiki/Special:FilePath/Med_1.png> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Med_1.png?width=300> .
1188 <http://commons.wikimedia.org/wiki/Special:FilePath/Med_1.png> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Med_1.png> .
1159 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_the_British_Army.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Flag_of_the_British_Army.svg> .
1159 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_the_British_Army.svg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_the_British_Army.svg?width=300> .
1159 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_the_British_Army.svg> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Flag_of_the_British_Army.svg> .
914 <http://en.wikipedia.org/wiki/Special:FilePath/Cricket_no_pic.png?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Cricket_no_pic.png> .
914 <http://en.wikipedia.org/wiki/Special:FilePath/Cricket_no_pic.png> <http://xmlns.com/foaf/0.1/thumbnail> <http://en.wikipedia.org/wiki/Special:FilePath/Cricket_no_pic.png?width=300> .
914 <http://en.wikipedia.org/wiki/Special:FilePath/Cricket_no_pic.png> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Cricket_no_pic.png> .
885 <http://dbpedia.org/resource/List_of_Tachinidae_genera> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://dbpedia.org/resource/List_of_Tachinidae_genera> .
784 <http://commons.wikimedia.org/wiki/Special:FilePath/Illinois_-_outline_map.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Illinois_-_outline_map.svg> .
Getting S,P,O Counts
OK, now let’s count the subject, predicate and object occurrences.
Subjects, predicates and objects are delimited with a single space (” “), everything else in the line we just count as an object (so we just count the final ” .” to the object).
Similar to the above pipeline, we use tee
again to multiplex the stream into three pipelines for subject, predicate and object counts.
Each of them is mostly based on cut
, first to get the fields (-f1
for subject, -f2
predicate, -f3-
object), then for limiting very long strings to only the first 1024 chars. While this actually introduces some false positive matches for long literals, it’s probably safe for URIs, and reduces sort times and file sizes for the object chunk a lot. If you want very accurate counts you should probably re-run without the cut -c-1024
lines.
Afterwards in each pipeline the occurrences of a node in the s,p,o positions are sorted and counted with uniq -c
, then gzipped with pigz
.
pv dbpedia_uniq.nt.gz |
zcat |
tee \
>( cut -f1 -d' ' |
cut -c-1024 |
sort -S16G -T/ssd/tmp/ --compress-program=lzop |
uniq -c |
pigz > dbpedia_1_subject_counts.txt.gz ) \
>( cut -f2 -d' ' |
cut -c-1024 |
sort -S16G -T/ssd/tmp/ --compress-program=lzop |
uniq -c |
pigz > dbpedia_2_predicate_counts.txt.gz ) \
>( cut -f3- -d' ' |
cut -c-1024 |
sort -S16G -T/ssd/tmp/ --compress-program=lzop |
uniq -c | pigz > dbpedia_3_object_counts.txt.gz ) \
>/dev/null
After 15 minutes we’re left with 3 files:
dbpedia_1_subject_counts.txt.gz (214M), dbpedia_2_predicate_counts.txt.gz (387K), dbpedia_3_object_counts.txt.gz (1.9G)
As expected there are only relatively few different predicates and the objects actually take up quite a lot of data.
Before getting the tops it’s quite useful to exclude subjects and objects that occur less than 10 times with awk
, which greatly reduces the filesizes and subsequent sort times:
zcat dbpedia_1_subject_counts.txt.gz | awk ' $1 > 9 { print } ' | pigz > dbpedia_1_subject_counts_o9.txt.gz
zcat dbpedia_3_object_counts.txt.gz | awk ' $1 > 9 { print } ' | pigz > dbpedia_3_object_counts_o9.txt.gz
dbpedia_1_subject_counts_o9.txt.gz (89M), dbpedia_3_object_counts_o9.txt.gz (30M).
As we can see from the size reduction already there’s actually way more objects occurring less than 10 times than subjects.
Similar to before the tops can be acquired with something like this:
for f in dbpedia_1_subject_counts_o9.txt.gz dbpedia_2_predicate_counts.txt.gz dbpedia_3_object_counts_o9.txt.gz ; do
zcat $f | sort -n -r | pigz > ${f%.txt.gz}_tops.txt.gz
done
So here they are, the …
Top 100 Subjects:
dbpedia_1_subject_counts_o9_tops.txt.gz (95M)
8118 <http://dbpedia.org/resource/Alphabetical_list_of_communes_of_Italy>
7110 <http://dbpedia.org/resource/List_of_places_in_Afghanistan>
6162 <http://dbpedia.org/resource/Index_of_Andhra_Pradesh-related_articles>
5857 <http://dbpedia.org/resource/List_of_populated_places_in_Bosnia_and_Herzegovina>
5712 <http://dbpedia.org/resource/2013_in_film>
5550 <http://dbpedia.org/resource/List_of_municipalities_of_Brazil>
5458 <http://dbpedia.org/resource/List_of_dialling_codes_in_Germany>
5405 <http://dbpedia.org/resource/IUCN_Red_List_vulnerable_species_(Plantae)>
5392 <http://dbpedia.org/resource/List_of_CJK_Unified_Ideographs,_part_3_of_4>
5392 <http://dbpedia.org/resource/List_of_CJK_Unified_Ideographs,_part_2_of_4>
5392 <http://dbpedia.org/resource/List_of_CJK_Unified_Ideographs,_part_1_of_4>
5182 <http://dbpedia.org/resource/IUCN_Red_List_vulnerable_species_(Animalia)>
5152 <http://dbpedia.org/resource/Index_of_India-related_articles>
5090 <http://dbpedia.org/resource/List_of_law_clerks_of_the_Supreme_Court_of_the_United_States>
5068 <http://dbpedia.org/resource/List_of_Social_Democratic_Party_of_Germany_members>
4942 <http://dbpedia.org/resource/List_of_painters_in_the_Web_Gallery_of_Art>
4873 <http://dbpedia.org/resource/List_of_stage_names>
4829 <http://dbpedia.org/resource/List_of_CJK_Unified_Ideographs,_part_4_of_4>
4795 <http://dbpedia.org/resource/List_of_Harvard_University_people>
4743 <http://dbpedia.org/resource/List_of_OMIM_disorder_codes>
4726 <http://dbpedia.org/resource/List_of_populated_places_in_Serbia>
4698 <http://dbpedia.org/resource/List_of_populated_places_in_Serbia_(alphabetic)>
4690 <http://dbpedia.org/resource/Index_of_philosophy_articles_(I%E2%80%93Q)>
4603 <http://dbpedia.org/resource/List_of_molluscan_genera_represented_in_the_fossil_record>
4493 <http://dbpedia.org/resource/List_of_American_television_programs_by_date>
4457 <http://dbpedia.org/resource/List_of_biographical_films>
4443 <http://dbpedia.org/resource/List_of_brachiopod_genera>
4355 <http://dbpedia.org/resource/List_of_English_writers>
4345 <http://dbpedia.org/resource/List_of_composers_by_name>
4341 <http://dbpedia.org/resource/List_of_historical_German_and_Czech_names_for_places_in_the_Czech_Republic>
4329 <http://dbpedia.org/resource/2012_in_film>
4275 <http://dbpedia.org/resource/List_of_people_from_Illinois>
4219 <http://dbpedia.org/resource/List_of_people_from_Texas>
4218 <http://dbpedia.org/resource/List_of_village_development_committees_of_Nepal>
4194 <http://dbpedia.org/resource/List_of_postal_codes_in_Portugal>
4159 <http://dbpedia.org/resource/IUCN_Red_List_data_deficient_species_(Chordata)>
4140 <http://dbpedia.org/resource/List_of_trilobite_genera>
4137 <http://dbpedia.org/resource/List_of_aircraft_engines>
4130 <http://dbpedia.org/resource/List_of_moths_of_Taiwan>
4084 <http://dbpedia.org/resource/List_of_flora_of_the_Sonoran_Desert_Region_by_common_name>
3992 <http://dbpedia.org/resource/List_of_film_score_composers>
3984 <http://dbpedia.org/resource/List_of_marine_gastropod_genera_in_the_fossil_record>
3930 <http://dbpedia.org/resource/List_of_performances_on_Top_of_the_Pops>
3886 <http://dbpedia.org/resource/List_of_gliders>
3873 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Romania>
3839 <http://dbpedia.org/resource/List_of_20th-century_classical_composers>
3768 <http://dbpedia.org/resource/Rosters_of_the_top_basketball_teams_in_European_club_competitions>
3740 <http://dbpedia.org/resource/List_of_airports_by_ICAO_code:_K>
3705 <http://dbpedia.org/resource/List_of_United_States_counties_and_county_equivalents>
3659 <http://dbpedia.org/resource/List_of_Russian_people>
3646 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Germany>
3615 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Switzerland>
3597 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Slovakia>
3589 <http://dbpedia.org/resource/List_of_protected_areas_of_China>
3583 <http://dbpedia.org/resource/List_of_Advanced_Dungeons_&_Dragons_2nd_edition_monsters>
3541 <http://dbpedia.org/resource/Index_of_U.S._counties>
3502 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Hungary>
3499 <http://dbpedia.org/resource/The_opera_corpus>
3466 <http://dbpedia.org/resource/List_of_German_Christian_Democratic_Union_politicians>
3466 <http://dbpedia.org/resource/2012%E2%80%9313_UEFA_Europa_League_qualifying_phase_and_play-off_round>
3439 <http://dbpedia.org/resource/List_of_viruses>
3439 <http://dbpedia.org/resource/Google_Street_View_in_the_United_States>
3432 <http://dbpedia.org/resource/2013%E2%80%9314_UEFA_Europa_League_qualifying_phase_and_play-off_round>
3430 <http://dbpedia.org/resource/List_of_Lepidoptera_of_the_Czech_Republic>
3392 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Greece>
3378 <http://dbpedia.org/resource/List_of_surnames_in_Russia>
3378 <http://dbpedia.org/resource/List_of_film_director_and_actor_collaborations>
3377 <http://dbpedia.org/resource/2010%E2%80%9311_UEFA_Europa_League_qualifying_phase_and_play-off_round>
3342 <http://dbpedia.org/resource/2009%E2%80%9310_UEFA_Europa_League_qualifying_phase_and_play-off_round>
3327 <http://dbpedia.org/resource/Index_of_World_War_II_articles_(U)>
3321 <http://dbpedia.org/resource/List_of_moths_of_Madagascar>
3295 <http://dbpedia.org/resource/List_of_country_houses_in_the_United_Kingdom>
3277 <http://dbpedia.org/resource/List_of_counties_by_U.S._state>
3273 <http://dbpedia.org/resource/List_of_licensed_and_localized_editions_of_Monopoly:_Europe>
3255 <http://dbpedia.org/resource/List_of_moths_of_North_America_(MONA_8322-11233)>
3254 <http://dbpedia.org/resource/List_of_local_administrative_units_of_Romania>
3236 <http://dbpedia.org/resource/Catalog_of_paintings_in_the_National_Gallery,_London>
3233 <http://dbpedia.org/resource/IUCN_Red_List_endangered_species_(Animalia)>
3232 <http://dbpedia.org/resource/IUCN_Red_List_near_threatened_species_(Animalia)>
3209 <http://dbpedia.org/resource/List_of_Chopped_episodes>
3201 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Poland>
3200 <http://dbpedia.org/resource/List_of_directorial_debuts>
3192 <http://dbpedia.org/resource/List_of_postal_codes_in_Germany>
3175 <http://dbpedia.org/resource/2010_in_film>
3163 <http://dbpedia.org/resource/Index_of_philosophy_articles_(R%E2%80%93Z)>
3156 <http://dbpedia.org/resource/List_of_bannered_U.S._Routes>
3136 <http://dbpedia.org/resource/Timeline_of_Google_Street_View>
3135 <http://dbpedia.org/resource/Index_of_Byzantine_Empire-related_articles>
3129 <http://dbpedia.org/resource/Index_of_Singapore-related_articles>
3114 <http://dbpedia.org/resource/List_of_postal_codes_of_Switzerland>
3107 <http://dbpedia.org/resource/2009_in_film>
3088 <http://dbpedia.org/resource/List_of_University_of_Pennsylvania_people>
3071 <http://dbpedia.org/resource/List_of_children's_television_series_by_country>
3065 <http://dbpedia.org/resource/List_of_populated_places_in_the_Netherlands>
3044 <http://dbpedia.org/resource/List_of_ZX_Spectrum_games>
3039 <http://dbpedia.org/resource/October_2011_in_sports>
3022 <http://dbpedia.org/resource/List_of_flora_of_Ohio>
3019 <http://dbpedia.org/resource/List_of_PlayStation_2_games>
3016 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Bulgaria>
3005 <http://dbpedia.org/resource/List_of_voice_actors>
Observations:
The top subjects are clearly dominated by list-like resources. Very big “normal” articles such as those of countries like dbpedia:United_States (1375 occurrences as subject) or dbpedia:Germany (1331 occurrences as subject) can only be found below ranks of 1518 or 1673. Scrolling through the top subject counts it seems that the amount of “List” vs. non-“List” resources slowly seems to equalize around 1000 occurrences (rank 3800+), but even for subjects that “only” occur ~500 times (rank 21000+) there seem to be ~1/4 “Lists” still.
Top 100 Predicates:
dbpedia_2_predicate_counts_tops.txt.gz (384K)
149707899 <http://dbpedia.org/ontology/wikiPageWikiLink>
86391520 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
33958849 <http://www.w3.org/2002/07/owl#sameAs>
18731754 <http://purl.org/dc/terms/subject>
13926391 <http://www.w3.org/2000/01/rdf-schema#label>
13494896 <http://dbpedia.org/ontology/wikiPageRevisionID>
13494875 <http://www.w3.org/ns/prov#wasDerivedFrom>
13494819 <http://dbpedia.org/ontology/wikiPageID>
10948106 <http://dbpedia.org/ontology/wikiPageOutDegree>
10948106 <http://dbpedia.org/ontology/wikiPageLength>
10948086 <http://xmlns.com/foaf/0.1/primaryTopic>
10948086 <http://xmlns.com/foaf/0.1/isPrimaryTopicOf>
10948086 <http://purl.org/dc/elements/1.1/language>
7081593 <http://dbpedia.org/ontology/wikiPageExternalLink>
6473988 <http://dbpedia.org/ontology/wikiPageRedirects>
5926272 <http://dbpedia.org/ontology/abstract>
5925778 <http://www.w3.org/2000/01/rdf-schema#comment>
4267352 <http://xmlns.com/foaf/0.1/name>
4041585 <http://dbpedia.org/property/hasPhotoCollection>
3781737 <http://dbpedia.org/property/name>
2342002 <http://purl.org/dc/elements/1.1/rights>
2268299 <http://www.w3.org/2004/02/skos/core#broader>
2084717 <http://purl.org/dc/elements/1.1/description>
1514496 <http://dbpedia.org/ontology/team>
1374565 <http://xmlns.com/foaf/0.1/depiction>
1374185 <http://dbpedia.org/ontology/thumbnail>
1363398 <http://dbpedia.org/ontology/wikiPageDisambiguates>
1289141 <http://dbpedia.org/property/title>
1231780 <http://dbpedia.org/property/subdivisionType>
1171004 <http://xmlns.com/foaf/0.1/thumbnail>
1122598 <http://www.w3.org/2004/02/skos/core#prefLabel>
1080114 <http://xmlns.com/foaf/0.1/givenName>
1058532 <http://www.georss.org/georss/point>
1052578 <http://dbpedia.org/property/shortDescription>
1052115 <http://xmlns.com/foaf/0.1/surname>
1005079 <http://dbpedia.org/ontology/birthPlace>
995639 <http://dbpedia.org/ontology/birthDate>
983813 <http://dbpedia.org/property/subdivisionName>
973597 <http://dbpedia.org/ontology/birthYear>
968085 <http://dbpedia.org/property/dateOfBirth>
907869 <http://www.w3.org/2003/01/geo/wgs84_pos#lat>
906919 <http://www.w3.org/2003/01/geo/wgs84_pos#long>
861765 <http://dbpedia.org/property/goals>
846283 <http://dbpedia.org/property/placeOfBirth>
846182 <http://dbpedia.org/ontology/isPartOf>
838381 <http://dbpedia.org/property/birthPlace>
826348 <http://dbpedia.org/property/years>
656559 <http://dbpedia.org/property/length>
653929 <http://dbpedia.org/property/date>
649375 <http://xmlns.com/foaf/0.1/homepage>
643162 <http://dbpedia.org/ontology/careerStation>
641528 <http://dbpedia.org/ontology/years>
574296 <http://dbpedia.org/property/birthDate>
556627 <http://dbpedia.org/property/genre>
553122 <http://dbpedia.org/ontology/country>
539366 <http://dbpedia.org/property/clubs>
529649 <http://dbpedia.org/property/location>
525787 <http://dbpedia.org/property/rd1Team>
512507 <http://dbpedia.org/ontology/numberOfGoals>
501875 <http://dbpedia.org/ontology/genre>
492028 <http://dbpedia.org/ontology/numberOfMatches>
453911 <http://dbpedia.org/ontology/deathDate>
449759 <http://dbpedia.org/ontology/deathYear>
448696 <http://dbpedia.org/property/dateOfDeath>
448362 <http://www.w3.org/2002/07/owl#equivalentClass>
446799 <http://dbpedia.org/property/caption>
446238 <http://www.w3.org/2000/01/rdf-schema#subClassOf>
440121 <http://dbpedia.org/property/votes>
437797 <http://dbpedia.org/property/wordnet_type>
435648 <http://dbpedia.org/property/type>
431262 <http://dbpedia.org/property/caps>
418391 <http://dbpedia.org/ontology/utcOffset>
362378 <http://dbpedia.org/property/percentage>
362327 <http://dbpedia.org/ontology/type>
355814 <http://dbpedia.org/property/country>
346584 <http://dbpedia.org/property/candidate>
340788 <http://dbpedia.org/property/starring>
338307 <http://dbpedia.org/ontology/location>
327879 <http://dbpedia.org/ontology/family>
326730 <http://dbpedia.org/property/longew>
326699 <http://dbpedia.org/property/latns>
315041 <http://dbpedia.org/property/writer>
314566 <http://dbpedia.org/ontology/starring>
312958 <http://dbpedia.org/property/label>
310020 <http://dbpedia.org/property/rd2Team>
306816 <http://dbpedia.org/property/settlementType>
306271 <http://dbpedia.org/property/longd>
306246 <http://dbpedia.org/property/latd>
306195 <http://dbpedia.org/ontology/populationTotal>
293237 <http://dbpedia.org/property/team>
283282 <http://dbpedia.org/property/producer>
279814 <http://dbpedia.org/ontology/occupation>
278108 <http://dbpedia.org/ontology/order>
277006 <http://dbpedia.org/property/episodenumber>
275430 <http://dbpedia.org/property/longm>
275416 <http://dbpedia.org/property/latm>
272562 <http://dbpedia.org/ontology/deathPlace>
267256 <http://dbpedia.org/ontology/class>
265454 <http://dbpedia.org/property/timezone>
264081 <http://dbpedia.org/ontology/viafId>
Observations:
The predicates are clearly dominated by dbpedia-owl:wikiPageWikiLink and rdf:type relations.
What’s a bit surprising for me is that dcterms:subject occurs less often than rdf:type, but my guess is that it’s probably due to YAGO and also hierarchy materialization (Athlete is also a Person). There’s a slight mismatch between dbpedia-owl:wikiPageRevisionID and prov:wasDerivedFrom. There are more dbpedia-ontology:abstracts than rdfs:comments and more geo:lats than geo:longs.
Top 100 Objects:
dbpedia_3_object_counts_o9_tops.txt.gz (33M)
10948086 <http://xmlns.com/foaf/0.1/Document> .
10948086 "en"^^<http://www.w3.org/2001/XMLSchema#string> .
6239553 "1"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
2250659 <http://dbpedia.org/class/yago/PhysicalEntity100001930> .
2169386 <http://dbpedia.org/class/yago/Object100002684> .
2155200 <http://www.w3.org/2002/07/owl#Thing> .
1974654 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#Agent> .
1974654 <http://dbpedia.org/ontology/Agent> .
1816213 <http://dbpedia.org/class/yago/YagoLegalActorGeo> .
1650316 <http://xmlns.com/foaf/0.1/Person> .
1649647 <http://wikidata.dbpedia.org/resource/Q5> .
1649647 <http://wikidata.dbpedia.org/resource/Q215627> .
1649647 <http://schema.org/Person> .
1649646 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#NaturalPerson> .
1649646 <http://dbpedia.org/ontology/Person> .
1621660 <http://dbpedia.org/class/yago/Whole100003553> .
1318799 <http://dbpedia.org/resource/Category:Living_people> .
1290718 <http://dbpedia.org/class/yago/YagoLegalActor> .
1257968 <http://dbpedia.org/class/yago/YagoPermanentlyLocatedEntity> .
1192248 <http://www.w3.org/2004/02/skos/core#Concept> .
1090313 <http://dbpedia.org/class/yago/LivingThing100004258> .
1090140 <http://dbpedia.org/class/yago/Organism100004475> .
1046726 <http://dbpedia.org/class/yago/Person100007846> .
1020287 <http://dbpedia.org/class/yago/CausalAgent100007347> .
868376 <http://dbpedia.org/resource/United_States> .
816854 <http://www.ontologydesignpatterns.org/ont/d0.owl#Location> .
816837 <http://schema.org/Place> .
816837 <http://dbpedia.org/ontology/Wikidata:Q532> .
816837 <http://dbpedia.org/ontology/Place> .
814269 <http://dbpedia.org/class/yago/YagoGeoEntity> .
726965 <http://dbpedia.org/class/yago/Abstraction100002137> .
658562 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#Situation> .
643162 <http://dbpedia.org/ontology/CareerStation> .
561841 <http://dbpedia.org/class/yago/LivingPeople> .
547827 <http://dbpedia.org/ontology/PopulatedPlace> .
547037 "0"^^<http://www.w3.org/2001/XMLSchema#integer> .
539993 "1"^^<http://www.w3.org/2001/XMLSchema#integer> .
531929 <http://www.opengis.net/gml/_Feature> .
528794 <http://dbpedia.org/class/yago/Artifact100021939> .
526256 <http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing> .
524742 <http://dbpedia.org/class/yago/Location100027167> .
505425 <http://dbpedia.org/class/yago/Region108630985> .
476724 "2"^^<http://www.w3.org/2001/XMLSchema#integer> .
469006 <http://dbpedia.org/ontology/Settlement> .
438713 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#InformationEntity> .
425044 <http://schema.org/CreativeWork> .
425044 <http://dbpedia.org/ontology/Work> .
419234 <http://dbpedia.org/resource/List_of_sovereign_states> .
401287 "N"@en .
400317 <http://dbpedia.org/resource/Animal> .
377252 "3"^^<http://www.w3.org/2001/XMLSchema#integer> .
358891 <http://dbpedia.org/class/yago/GeographicalArea108574314> .
350209 <http://dbpedia.org/class/yago/District108552138> .
347718 <http://dbpedia.org/class/yago/Group100031264> .
336091 <http://dbpedia.org/ontology/Athlete> .
335320 "4"^^<http://www.w3.org/2001/XMLSchema#integer> .
321614 <http://dbpedia.org/class/yago/AdministrativeDistrict108491826> .
313062 <http://dbpedia.org/class/yago/SocialGroup107950920> .
302658 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#SocialPerson> .
302658 <http://schema.org/Organization> .
302658 <http://dbpedia.org/ontology/Organisation> .
292395 "28"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
288074 "29"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
287957 "27"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
286256 "5"^^<http://www.w3.org/2001/XMLSchema#integer> .
283702 <http://dbpedia.org/class/yago/Organization108008335> .
279633 "yes"@en .
279134 <http://dbpedia.org/ontology/SportsTeamMember> .
279134 <http://dbpedia.org/ontology/OrganisationMember> .
277439 "30"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
277025 "26"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
276000 "6"^^<http://www.w3.org/2001/XMLSchema#integer> .
264578 "31"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
263773 <http://dbpedia.org/class/yago/Contestant109613191> .
261435 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#Organism> .
261435 <http://dbpedia.org/ontology/Species> .
260007 "E"@en .
256474 <http://dbpedia.org/ontology/Eukaryote> .
255993 "25"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
252172 <http://dbpedia.org/resource/England> .
249675 "32"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
240706 <http://dbpedia.org/resource/Iran_Standard_Time> .
234043 "24"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
231236 "33"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
228539 "7"^^<http://www.w3.org/2001/XMLSchema#integer> .
221419 "0".
219180 <http://dbpedia.org/class/yago/Player110439851> .
218573 "23"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
218297 <http://dbpedia.org/class/yago/PsychologicalFeature100023100> .
217297 "34"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
215320 <http://dbpedia.org/class/yago/Athlete109820263> .
214359 "18"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
208654 <http://dbpedia.org/class/yago/Tract108673395> .
208383 <http://dbpedia.org/resource/Arthropod> .
206076 "22"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
205888 "8"^^<http://www.w3.org/2001/XMLSchema#integer> .
204693 <http://dbpedia.org/resource/Lepidoptera> .
204220 "21"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
203472 <http://dbpedia.org/class/yago/Instrumentality103575240> .
202270 "35"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
Observations:
The object counts are dominated on top with an order of magnitude difference by foaf:Document and “en”. The non-negative “1” follows an order of magnitude ahead of the normal “0” and “1” 😉 In between a lot of very useful types follow, and we can see that we have a lot of information about physical things, people, concepts and places. It’s also nice to see http://wikidata.dbpedia.org/resource/Q5 right under foaf:Person, even though the URI doesn’t resolve anymore(?) 🙁
The first “real” “A-Box” resource is dbpedia:United_States, followed by dbpedia:Animal, dbpedia:England, dbpedia:Iran_Standard_Time, dbpedia:Arthropod, dbpedia:Lepidoptera, dbpedia:Canada, dbpedia:Insect, dbpedia:France, dbpedia:United_Kingdom, dbpedia:India, dbpedia:Germany. In general it seems as if apart from ontology types many instances of types country, biological genus and city occur very often as objects.
The top literals seem to be numbers, especially years and single letters.
Conclusion
We’ve seen that it’s sadly not possible to get basic top-degree-counts for big datasets via SPARQL, as the endpoints don’t seem to be optimized for these kind of queries. I hope this changes in the future as it’s quite useful to know degree distributions for all kinds of queries. Especially in the machine learning sector it seems quite essential to know if you’re dealing with a “normal” node or one of the exceptional top nodes that is several orders of magnitude bigger than the rest.
Hope you enjoyed. Feedback welcome, as always.
Further reading
Thanks for all the feedback i got on this post. There are somewhat similar works, that you might be interested in:
- Dinesh Reddy, Magnus Knuth, Harald Sack: DBpedia GraphMeasures based on the wikiPageWikiLink property, computes PageRank, HITS, Inlink and Outlink degree for each Wikipedia Article. (These datasets are actually loaded on http://dbpedia.org/sparql, just not in the default graph http://dbpedia.org.)
- http://dbtrends.aksw.org/ calculates some stats similar to this post (and some more), but atm sadly only for DBpedia 3.9 and without stats about Literals and http://dbpedia.org/resource/Category:* resources.
Updates:
- 2015-02-01: Further reading