{"id":643,"date":"2015-01-28T17:33:01","date_gmt":"2015-01-28T16:33:01","guid":{"rendered":"https:\/\/joernhees.de\/blog\/?p=643"},"modified":"2016-09-28T22:45:50","modified_gmt":"2016-09-28T20:45:50","slug":"dbpedia-2014-stats-top-subjects-predicates-and-objects","status":"publish","type":"post","link":"https:\/\/joernhees.de\/blog\/2015\/01\/28\/dbpedia-2014-stats-top-subjects-predicates-and-objects\/","title":{"rendered":"DBpedia 2014 Stats &#8211; Top Subjects, Predicates and Objects"},"content":{"rendered":"<p>Ever wondered what the top subjects \/ predicates \/ objects are in DBpedia?<\/p>\n<p>I recently came across this problem while trying to draw a random sample of nodes from DBpedia which follow a given degree distribution for my PhD.<\/p>\n<p>Turns out this is actually more difficult than i expected. Mostly due to the fact that quad stores don&#8217;t optimize for such queries. This means that you can&#8217;t just ask a SPARQL endpoint (not even <a href=\"https:\/\/joernhees.de\/blog\/2014\/11\/10\/setting-up-a-local-dbpedia-2014-mirror-with-virtuoso-7-1-0\/\" title=\"Setting up a local DBpedia 2014 mirror with Virtuoso 7.1.0\" target=\"_blank\">your local one<\/a>) to give you the top subjects, predicates or objects with a query like this:<\/p>\n<pre><code class=\"sql\">select ?n count(*) as ?c\r\nwhere {\r\n  ?n ?p ?o.\r\n}\r\norder by desc(?c)\r\nlimit 10\r\n<\/code><\/pre>\n<p>Try yourself <a href=\"http:\/\/dbpedia.org\/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&#038;qtxt=SELECT+%3Fn+COUNT%28*%29+AS+%3Fc%0D%0AWHERE+{%0D%0A++%3Fn+%3Fp+%3Fo.%0D%0A}%0D%0AORDER+BY+DESC%28%3Fc%29%0D%0ALIMIT+10&#038;format=text%2Fhtml&#038;timeout=15000&#038;debug=on\" target=\"_blank\">here<\/a> if you don&#8217;t believe me&#8230; (i set it to time out after 15 seconds and it will return quite a dangerously nonsensical result if you&#8217;re not aware that you might <a href=\"https:\/\/github.com\/openlink\/virtuoso-opensource\/issues\/112\" target=\"_blank\">get partial answers<\/a>).<\/p>\n<h3>Some Rant<\/h3>\n<p>So this lead me to the fascinating conclusion that our beloved RDF query language doesn&#8217;t even allow us to answer simple questions such as &#8220;which node is most often used as a subject \/ predicate \/ object?&#8221; (we&#8217;re talking with a single SPARQL endpoint here, don&#8217;t even try dragging me into an open\/closed world assumption discussion, &#8230;).<\/p>\n<p>So, all is great, let&#8217;s just not ask those evil questions&#8230;<\/p>\n<p>&#8230; said no (computer) scientist ever.<\/p>\n<p>So let&#8217;s get our hands dirty and use some unix tool magic&#8230;<\/p>\n<h2>Working with Dumps in NT Format<\/h2>\n<p>Luckily, I already had all the dumps laid out locally as described <a href=\"https:\/\/joernhees.de\/blog\/2014\/11\/10\/setting-up-a-local-dbpedia-2014-mirror-with-virtuoso-7-1-0\/\" title=\"Setting up a local DBpedia 2014 mirror with Virtuoso 7.1.0\" target=\"_blank\">here<\/a>, and lucky again, they are in N-Triples format.<\/p>\n<p><a href=\"http:\/\/www.w3.org\/TR\/n-triples\/\" target=\"_blank\">N-Triples<\/a> is a line based format, which means we have exactly one triple per line. I don&#8217;t exactly know whom to thank for this, but should you ever read this (wait, why are you reading my blog?) THANK YOU. It means that neither subject nor predicate nor object can contain (unescaped) newlines. And this means that you can actually quite sanely sort and parse .nt files with standard unix tools that have been optimized by generations of smart people.<\/p>\n<p>I think you see where this is going: a good old bash one-liner with grep, cut, sort and uniq, by far <a href=\"http:\/\/aadrake.com\/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html\" target=\"_blank\">the fastest tools i know for the job<\/a>.<\/p>\n<h2>A Word about Sort Orders and Locales<\/h2>\n<p>Sort orders depend on your locale! This means that files sorted with a locale such as en_US.UTF-8 are not properly sorted for someone with a locale such as de_DE.UTF-8. Hence it&#8217;s wise to always run this in a shell before working with sort:<\/p>\n<pre><code class=\"bash\">export LC_ALL=C\r\n<\/code><\/pre>\n<p>It resets your locale to a classic C byte-wise one, having the nice side effect that it&#8217;s faster as well.<\/p>\n<h2>Deduplication<\/h2>\n<p>First, it turns out the DBpedia dumps actually contain quite an astonishing amount of duplicate triples. This is not a problem if loaded into a quad store as they&#8217;ll just count once, but for counting them like we will, it is a problem.<\/p>\n<p>To split them apart let&#8217;s do the following: we pick up all the dump files that are loaded into our endpoint with <code>pv<\/code> a handy little tool similar to <code>cat<\/code>, but it shows a nice progress bar. Then we decompress with <code>zcat<\/code>, remove comments from the files with <code>grep<\/code> and then call <code>sort<\/code>. We actually tell sort to use a ton of RAM (32 GB), but actually not even that is enough for the > 80 GB decompressed dumps, so we need temp files. We can direct sort to put them onto an SSD instead of just in <code>\/tmp<\/code> as by default, and we can also compress those temp files on the fly. <code>lzop<\/code> is a very fast compression tool and the perfect fit for this (not compressing the files with this actually degrades performance even at 300 MB\/s write speeds of the SSD!). After this we use <code>tee<\/code> to multiplex our stream into two channels: one plain <code>uniq<\/code> and gzipped with <code>pigz<\/code> (like <code>gzip<\/code> but parallel, as gzipping > 80 GB becomes quite the bottleneck here otherwise) into <code>dbpedia_uniq.nt.gz<\/code> and another invocation of <code>uniq -c -d<\/code> which only counts the duplicate lines and <code>gzip<\/code>s (this is ok to be single threaded, as it&#8217;s not sooo big) them into <code>dbpedia_dups.txt.gz<\/code>.<\/p>\n<pre><code class=\"bash\">pv \/usr\/local\/data\/datasets\/remote\/dbpedia\/2014\/importedGraphs\/{dbpedia.org,ext.dbpedia.org,pagelinks.dbpedia.org,topicalconcepts.dbpedia.org}\/* |\r\n  zcat |  # decompress\r\n  grep -v -E '^\\s*#' |  # ignore comments in the nt files\r\n  sort -S32G -T\/ssd\/tmp\/ --compress-program=lzop |\r\n  tee \\\r\n    &gt;( uniq | pigz &gt; dbpedia_uniq.nt.gz ) \\\r\n    &gt;( uniq -c -d | gzip &gt; dbpedia_dups.txt.gz ) \\\r\n    &gt;\/dev\/null\r\n<\/code><\/pre>\n<p>As you can see from the first line include external, pagelinks and topicalconcepts datasets, but the process is really the same no matter what.<\/p>\n<p>After ~ 10 minutes we&#8217;re left with a 6.5 GB <a href=\"http:\/\/projects.dfki.uni-kl.de\/~hees\/dbpedia\/2014\/stats\/dbpedia_uniq.nt.gz\" target=\"_blank\">dbpedia_uniq.nt.gz<\/a> (547,084,682 unique triples) and a 238 MB <a href=\"http:\/\/projects.dfki.uni-kl.de\/~hees\/dbpedia\/2014\/stats\/dbpedia_dups.txt.gz\" target=\"_blank\">dbpedia_dups.txt.gz<\/a>.<\/p>\n<h2>Top Duplicates<\/h2>\n<p>The top duplicates as acquired with<\/p>\n<pre><code class=\"bash\">zcat dbpedia_dups.txt.gz | sort -n -r -S8G | head -n20\r\n<\/code><\/pre>\n<p>are (<a href=\"http:\/\/projects.dfki.uni-kl.de\/~hees\/dbpedia\/2014\/stats\/dbpedia_dups_tops.txt.gz\" target=\"_blank\">full file<\/a> (238 MB)):<\/p>\n<pre><code>   4891 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Flag_of_Slovenia.svg?width=300&gt; &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/File:Flag_of_Slovenia.svg&gt; .\r\n   4891 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Flag_of_Slovenia.svg&gt; &lt;http:\/\/xmlns.com\/foaf\/0.1\/thumbnail&gt; &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Flag_of_Slovenia.svg?width=300&gt; .\r\n   4891 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Flag_of_Slovenia.svg&gt; &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/File:Flag_of_Slovenia.svg&gt; .\r\n   1520 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Naval_Ensign_of_the_United_Kingdom.svg?width=300&gt; &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/File:Naval_Ensign_of_the_United_Kingdom.svg&gt; .\r\n   1520 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Naval_Ensign_of_the_United_Kingdom.svg&gt; &lt;http:\/\/xmlns.com\/foaf\/0.1\/thumbnail&gt; &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Naval_Ensign_of_the_United_Kingdom.svg?width=300&gt; .\r\n   1520 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Naval_Ensign_of_the_United_Kingdom.svg&gt; &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/File:Naval_Ensign_of_the_United_Kingdom.svg&gt; .\r\n   1195 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Airplane_silhouette.svg?width=300&gt; &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/File:Airplane_silhouette.svg&gt; .\r\n   1195 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Airplane_silhouette.svg&gt; &lt;http:\/\/xmlns.com\/foaf\/0.1\/thumbnail&gt; &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Airplane_silhouette.svg?width=300&gt; .\r\n   1195 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Airplane_silhouette.svg&gt; &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/File:Airplane_silhouette.svg&gt; .\r\n   1188 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Med_1.png?width=300&gt; &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/File:Med_1.png&gt; .\r\n   1188 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Med_1.png&gt; &lt;http:\/\/xmlns.com\/foaf\/0.1\/thumbnail&gt; &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Med_1.png?width=300&gt; .\r\n   1188 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Med_1.png&gt; &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/File:Med_1.png&gt; .\r\n   1159 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Flag_of_the_British_Army.svg?width=300&gt; &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/File:Flag_of_the_British_Army.svg&gt; .\r\n   1159 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Flag_of_the_British_Army.svg&gt; &lt;http:\/\/xmlns.com\/foaf\/0.1\/thumbnail&gt; &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Flag_of_the_British_Army.svg?width=300&gt; .\r\n   1159 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Flag_of_the_British_Army.svg&gt; &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/File:Flag_of_the_British_Army.svg&gt; .\r\n    914 &lt;http:\/\/en.wikipedia.org\/wiki\/Special:FilePath\/Cricket_no_pic.png?width=300&gt; &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/File:Cricket_no_pic.png&gt; .\r\n    914 &lt;http:\/\/en.wikipedia.org\/wiki\/Special:FilePath\/Cricket_no_pic.png&gt; &lt;http:\/\/xmlns.com\/foaf\/0.1\/thumbnail&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/Special:FilePath\/Cricket_no_pic.png?width=300&gt; .\r\n    914 &lt;http:\/\/en.wikipedia.org\/wiki\/Special:FilePath\/Cricket_no_pic.png&gt; &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/File:Cricket_no_pic.png&gt; .\r\n    885 &lt;http:\/\/dbpedia.org\/resource\/List_of_Tachinidae_genera&gt; &lt;http:\/\/dbpedia.org\/ontology\/wikiPageWikiLink&gt; &lt;http:\/\/dbpedia.org\/resource\/List_of_Tachinidae_genera&gt; .\r\n    784 &lt;http:\/\/commons.wikimedia.org\/wiki\/Special:FilePath\/Illinois_-_outline_map.svg?width=300&gt; &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt; &lt;http:\/\/en.wikipedia.org\/wiki\/File:Illinois_-_outline_map.svg&gt; .\r\n<\/code><\/pre>\n<h2>Getting S,P,O Counts<\/h2>\n<p>OK, now let&#8217;s count the subject, predicate and object occurrences.<br \/>\nSubjects, predicates and objects are delimited with a single space (&#8221; &#8220;), everything else in the line we just count as an object (so we just count the final &#8221; .&#8221; to the object).<br \/>\nSimilar to the above pipeline, we use <code>tee<\/code> again to multiplex the stream into three pipelines for subject, predicate and object counts.<br \/>\nEach of them is mostly based on <code>cut<\/code>, first to get the fields (<code>-f1<\/code> for subject, <code>-f2<\/code> predicate, <code>-f3-<\/code> object), then for limiting very long strings to only the first 1024 chars. While this actually introduces some false positive matches for long literals, it&#8217;s probably safe for URIs, and reduces sort times and file sizes for the object chunk a lot. If you want very accurate counts you should probably re-run without the <code>cut -c-1024<\/code> lines.<br \/>\nAfterwards in each pipeline the occurrences of a node in the s,p,o positions are sorted and counted with <code>uniq -c<\/code>, then gzipped with <code>pigz<\/code>.<\/p>\n<pre><code class=\"bash\">pv dbpedia_uniq.nt.gz |\r\n  zcat |\r\n  tee \\\r\n    &gt;( cut -f1 -d' ' |\r\n       cut -c-1024 |\r\n       sort -S16G -T\/ssd\/tmp\/ --compress-program=lzop |\r\n       uniq -c |\r\n       pigz &gt; dbpedia_1_subject_counts.txt.gz ) \\\r\n    &gt;( cut -f2 -d' ' |\r\n       cut -c-1024 |\r\n       sort -S16G -T\/ssd\/tmp\/ --compress-program=lzop |\r\n       uniq -c |\r\n       pigz &gt; dbpedia_2_predicate_counts.txt.gz ) \\\r\n    &gt;( cut -f3- -d' ' |\r\n       cut -c-1024 |\r\n       sort -S16G -T\/ssd\/tmp\/ --compress-program=lzop |\r\n       uniq -c | pigz &gt; dbpedia_3_object_counts.txt.gz ) \\\r\n    &gt;\/dev\/null\r\n<\/code><\/pre>\n<p>After 15 minutes we&#8217;re left with 3 files:<br \/>\n<a href=\"http:\/\/projects.dfki.uni-kl.de\/~hees\/dbpedia\/2014\/stats\/dbpedia_1_subject_counts.txt.gz\" target=\"_blank\">dbpedia_1_subject_counts.txt.gz<\/a> (214M), <a href=\"http:\/\/projects.dfki.uni-kl.de\/~hees\/dbpedia\/2014\/stats\/dbpedia_2_predicate_counts.txt.gz\" target=\"_blank\">dbpedia_2_predicate_counts.txt.gz<\/a> (387K), <a href=\"http:\/\/projects.dfki.uni-kl.de\/~hees\/dbpedia\/2014\/stats\/dbpedia_3_object_counts.txt.gz\" target=\"_blank\">dbpedia_3_object_counts.txt.gz<\/a> (1.9G)<\/p>\n<p>As expected there are only relatively few different predicates and the objects actually take up quite a lot of data.<\/p>\n<p>Before getting the tops it&#8217;s quite useful to exclude subjects and objects that occur less than 10 times with <code>awk<\/code>, which greatly reduces the filesizes and subsequent sort times:<\/p>\n<pre><code class=\"bash\">zcat dbpedia_1_subject_counts.txt.gz | awk ' $1 &gt; 9 { print } ' | pigz &gt; dbpedia_1_subject_counts_o9.txt.gz\r\nzcat dbpedia_3_object_counts.txt.gz | awk ' $1 &gt; 9 { print } ' | pigz &gt; dbpedia_3_object_counts_o9.txt.gz\r\n<\/code><\/pre>\n<p><a href=\"http:\/\/projects.dfki.uni-kl.de\/~hees\/dbpedia\/2014\/stats\/dbpedia_1_subject_counts_o9.txt.gz\" target=\"_blank\">dbpedia_1_subject_counts_o9.txt.gz<\/a> (89M), <a href=\"http:\/\/projects.dfki.uni-kl.de\/~hees\/dbpedia\/2014\/stats\/dbpedia_3_object_counts_o9.txt.gz\" target=\"_blank\">dbpedia_3_object_counts_o9.txt.gz<\/a> (30M).<\/p>\n<p>As we can see from the size reduction already there&#8217;s actually way more objects occurring less than 10 times than subjects.<\/p>\n<p>Similar to before the tops can be acquired with something like this:<\/p>\n<pre><code class=\"bash\">for f in dbpedia_1_subject_counts_o9.txt.gz dbpedia_2_predicate_counts.txt.gz dbpedia_3_object_counts_o9.txt.gz ; do\r\n  zcat $f | sort -n -r | pigz &gt; ${f%.txt.gz}_tops.txt.gz\r\ndone\r\n<\/code><\/pre>\n<p>So here they are, the &#8230;<\/p>\n<h2>Top 100 Subjects:<\/h2>\n<p><a href=\"http:\/\/projects.dfki.uni-kl.de\/~hees\/dbpedia\/2014\/stats\/dbpedia_1_subject_counts_o9_tops.txt.gz\" target=\"_blank\">dbpedia_1_subject_counts_o9_tops.txt.gz<\/a> (95M)<\/p>\n<pre><code>   8118 &lt;http:\/\/dbpedia.org\/resource\/Alphabetical_list_of_communes_of_Italy&gt;\r\n   7110 &lt;http:\/\/dbpedia.org\/resource\/List_of_places_in_Afghanistan&gt;\r\n   6162 &lt;http:\/\/dbpedia.org\/resource\/Index_of_Andhra_Pradesh-related_articles&gt;\r\n   5857 &lt;http:\/\/dbpedia.org\/resource\/List_of_populated_places_in_Bosnia_and_Herzegovina&gt;\r\n   5712 &lt;http:\/\/dbpedia.org\/resource\/2013_in_film&gt;\r\n   5550 &lt;http:\/\/dbpedia.org\/resource\/List_of_municipalities_of_Brazil&gt;\r\n   5458 &lt;http:\/\/dbpedia.org\/resource\/List_of_dialling_codes_in_Germany&gt;\r\n   5405 &lt;http:\/\/dbpedia.org\/resource\/IUCN_Red_List_vulnerable_species_(Plantae)&gt;\r\n   5392 &lt;http:\/\/dbpedia.org\/resource\/List_of_CJK_Unified_Ideographs,_part_3_of_4&gt;\r\n   5392 &lt;http:\/\/dbpedia.org\/resource\/List_of_CJK_Unified_Ideographs,_part_2_of_4&gt;\r\n   5392 &lt;http:\/\/dbpedia.org\/resource\/List_of_CJK_Unified_Ideographs,_part_1_of_4&gt;\r\n   5182 &lt;http:\/\/dbpedia.org\/resource\/IUCN_Red_List_vulnerable_species_(Animalia)&gt;\r\n   5152 &lt;http:\/\/dbpedia.org\/resource\/Index_of_India-related_articles&gt;\r\n   5090 &lt;http:\/\/dbpedia.org\/resource\/List_of_law_clerks_of_the_Supreme_Court_of_the_United_States&gt;\r\n   5068 &lt;http:\/\/dbpedia.org\/resource\/List_of_Social_Democratic_Party_of_Germany_members&gt;\r\n   4942 &lt;http:\/\/dbpedia.org\/resource\/List_of_painters_in_the_Web_Gallery_of_Art&gt;\r\n   4873 &lt;http:\/\/dbpedia.org\/resource\/List_of_stage_names&gt;\r\n   4829 &lt;http:\/\/dbpedia.org\/resource\/List_of_CJK_Unified_Ideographs,_part_4_of_4&gt;\r\n   4795 &lt;http:\/\/dbpedia.org\/resource\/List_of_Harvard_University_people&gt;\r\n   4743 &lt;http:\/\/dbpedia.org\/resource\/List_of_OMIM_disorder_codes&gt;\r\n   4726 &lt;http:\/\/dbpedia.org\/resource\/List_of_populated_places_in_Serbia&gt;\r\n   4698 &lt;http:\/\/dbpedia.org\/resource\/List_of_populated_places_in_Serbia_(alphabetic)&gt;\r\n   4690 &lt;http:\/\/dbpedia.org\/resource\/Index_of_philosophy_articles_(I%E2%80%93Q)&gt;\r\n   4603 &lt;http:\/\/dbpedia.org\/resource\/List_of_molluscan_genera_represented_in_the_fossil_record&gt;\r\n   4493 &lt;http:\/\/dbpedia.org\/resource\/List_of_American_television_programs_by_date&gt;\r\n   4457 &lt;http:\/\/dbpedia.org\/resource\/List_of_biographical_films&gt;\r\n   4443 &lt;http:\/\/dbpedia.org\/resource\/List_of_brachiopod_genera&gt;\r\n   4355 &lt;http:\/\/dbpedia.org\/resource\/List_of_English_writers&gt;\r\n   4345 &lt;http:\/\/dbpedia.org\/resource\/List_of_composers_by_name&gt;\r\n   4341 &lt;http:\/\/dbpedia.org\/resource\/List_of_historical_German_and_Czech_names_for_places_in_the_Czech_Republic&gt;\r\n   4329 &lt;http:\/\/dbpedia.org\/resource\/2012_in_film&gt;\r\n   4275 &lt;http:\/\/dbpedia.org\/resource\/List_of_people_from_Illinois&gt;\r\n   4219 &lt;http:\/\/dbpedia.org\/resource\/List_of_people_from_Texas&gt;\r\n   4218 &lt;http:\/\/dbpedia.org\/resource\/List_of_village_development_committees_of_Nepal&gt;\r\n   4194 &lt;http:\/\/dbpedia.org\/resource\/List_of_postal_codes_in_Portugal&gt;\r\n   4159 &lt;http:\/\/dbpedia.org\/resource\/IUCN_Red_List_data_deficient_species_(Chordata)&gt;\r\n   4140 &lt;http:\/\/dbpedia.org\/resource\/List_of_trilobite_genera&gt;\r\n   4137 &lt;http:\/\/dbpedia.org\/resource\/List_of_aircraft_engines&gt;\r\n   4130 &lt;http:\/\/dbpedia.org\/resource\/List_of_moths_of_Taiwan&gt;\r\n   4084 &lt;http:\/\/dbpedia.org\/resource\/List_of_flora_of_the_Sonoran_Desert_Region_by_common_name&gt;\r\n   3992 &lt;http:\/\/dbpedia.org\/resource\/List_of_film_score_composers&gt;\r\n   3984 &lt;http:\/\/dbpedia.org\/resource\/List_of_marine_gastropod_genera_in_the_fossil_record&gt;\r\n   3930 &lt;http:\/\/dbpedia.org\/resource\/List_of_performances_on_Top_of_the_Pops&gt;\r\n   3886 &lt;http:\/\/dbpedia.org\/resource\/List_of_gliders&gt;\r\n   3873 &lt;http:\/\/dbpedia.org\/resource\/List_of_Lepidoptera_of_Romania&gt;\r\n   3839 &lt;http:\/\/dbpedia.org\/resource\/List_of_20th-century_classical_composers&gt;\r\n   3768 &lt;http:\/\/dbpedia.org\/resource\/Rosters_of_the_top_basketball_teams_in_European_club_competitions&gt;\r\n   3740 &lt;http:\/\/dbpedia.org\/resource\/List_of_airports_by_ICAO_code:_K&gt;\r\n   3705 &lt;http:\/\/dbpedia.org\/resource\/List_of_United_States_counties_and_county_equivalents&gt;\r\n   3659 &lt;http:\/\/dbpedia.org\/resource\/List_of_Russian_people&gt;\r\n   3646 &lt;http:\/\/dbpedia.org\/resource\/List_of_Lepidoptera_of_Germany&gt;\r\n   3615 &lt;http:\/\/dbpedia.org\/resource\/List_of_Lepidoptera_of_Switzerland&gt;\r\n   3597 &lt;http:\/\/dbpedia.org\/resource\/List_of_Lepidoptera_of_Slovakia&gt;\r\n   3589 &lt;http:\/\/dbpedia.org\/resource\/List_of_protected_areas_of_China&gt;\r\n   3583 &lt;http:\/\/dbpedia.org\/resource\/List_of_Advanced_Dungeons_&amp;_Dragons_2nd_edition_monsters&gt;\r\n   3541 &lt;http:\/\/dbpedia.org\/resource\/Index_of_U.S._counties&gt;\r\n   3502 &lt;http:\/\/dbpedia.org\/resource\/List_of_Lepidoptera_of_Hungary&gt;\r\n   3499 &lt;http:\/\/dbpedia.org\/resource\/The_opera_corpus&gt;\r\n   3466 &lt;http:\/\/dbpedia.org\/resource\/List_of_German_Christian_Democratic_Union_politicians&gt;\r\n   3466 &lt;http:\/\/dbpedia.org\/resource\/2012%E2%80%9313_UEFA_Europa_League_qualifying_phase_and_play-off_round&gt;\r\n   3439 &lt;http:\/\/dbpedia.org\/resource\/List_of_viruses&gt;\r\n   3439 &lt;http:\/\/dbpedia.org\/resource\/Google_Street_View_in_the_United_States&gt;\r\n   3432 &lt;http:\/\/dbpedia.org\/resource\/2013%E2%80%9314_UEFA_Europa_League_qualifying_phase_and_play-off_round&gt;\r\n   3430 &lt;http:\/\/dbpedia.org\/resource\/List_of_Lepidoptera_of_the_Czech_Republic&gt;\r\n   3392 &lt;http:\/\/dbpedia.org\/resource\/List_of_Lepidoptera_of_Greece&gt;\r\n   3378 &lt;http:\/\/dbpedia.org\/resource\/List_of_surnames_in_Russia&gt;\r\n   3378 &lt;http:\/\/dbpedia.org\/resource\/List_of_film_director_and_actor_collaborations&gt;\r\n   3377 &lt;http:\/\/dbpedia.org\/resource\/2010%E2%80%9311_UEFA_Europa_League_qualifying_phase_and_play-off_round&gt;\r\n   3342 &lt;http:\/\/dbpedia.org\/resource\/2009%E2%80%9310_UEFA_Europa_League_qualifying_phase_and_play-off_round&gt;\r\n   3327 &lt;http:\/\/dbpedia.org\/resource\/Index_of_World_War_II_articles_(U)&gt;\r\n   3321 &lt;http:\/\/dbpedia.org\/resource\/List_of_moths_of_Madagascar&gt;\r\n   3295 &lt;http:\/\/dbpedia.org\/resource\/List_of_country_houses_in_the_United_Kingdom&gt;\r\n   3277 &lt;http:\/\/dbpedia.org\/resource\/List_of_counties_by_U.S._state&gt;\r\n   3273 &lt;http:\/\/dbpedia.org\/resource\/List_of_licensed_and_localized_editions_of_Monopoly:_Europe&gt;\r\n   3255 &lt;http:\/\/dbpedia.org\/resource\/List_of_moths_of_North_America_(MONA_8322-11233)&gt;\r\n   3254 &lt;http:\/\/dbpedia.org\/resource\/List_of_local_administrative_units_of_Romania&gt;\r\n   3236 &lt;http:\/\/dbpedia.org\/resource\/Catalog_of_paintings_in_the_National_Gallery,_London&gt;\r\n   3233 &lt;http:\/\/dbpedia.org\/resource\/IUCN_Red_List_endangered_species_(Animalia)&gt;\r\n   3232 &lt;http:\/\/dbpedia.org\/resource\/IUCN_Red_List_near_threatened_species_(Animalia)&gt;\r\n   3209 &lt;http:\/\/dbpedia.org\/resource\/List_of_Chopped_episodes&gt;\r\n   3201 &lt;http:\/\/dbpedia.org\/resource\/List_of_Lepidoptera_of_Poland&gt;\r\n   3200 &lt;http:\/\/dbpedia.org\/resource\/List_of_directorial_debuts&gt;\r\n   3192 &lt;http:\/\/dbpedia.org\/resource\/List_of_postal_codes_in_Germany&gt;\r\n   3175 &lt;http:\/\/dbpedia.org\/resource\/2010_in_film&gt;\r\n   3163 &lt;http:\/\/dbpedia.org\/resource\/Index_of_philosophy_articles_(R%E2%80%93Z)&gt;\r\n   3156 &lt;http:\/\/dbpedia.org\/resource\/List_of_bannered_U.S._Routes&gt;\r\n   3136 &lt;http:\/\/dbpedia.org\/resource\/Timeline_of_Google_Street_View&gt;\r\n   3135 &lt;http:\/\/dbpedia.org\/resource\/Index_of_Byzantine_Empire-related_articles&gt;\r\n   3129 &lt;http:\/\/dbpedia.org\/resource\/Index_of_Singapore-related_articles&gt;\r\n   3114 &lt;http:\/\/dbpedia.org\/resource\/List_of_postal_codes_of_Switzerland&gt;\r\n   3107 &lt;http:\/\/dbpedia.org\/resource\/2009_in_film&gt;\r\n   3088 &lt;http:\/\/dbpedia.org\/resource\/List_of_University_of_Pennsylvania_people&gt;\r\n   3071 &lt;http:\/\/dbpedia.org\/resource\/List_of_children's_television_series_by_country&gt;\r\n   3065 &lt;http:\/\/dbpedia.org\/resource\/List_of_populated_places_in_the_Netherlands&gt;\r\n   3044 &lt;http:\/\/dbpedia.org\/resource\/List_of_ZX_Spectrum_games&gt;\r\n   3039 &lt;http:\/\/dbpedia.org\/resource\/October_2011_in_sports&gt;\r\n   3022 &lt;http:\/\/dbpedia.org\/resource\/List_of_flora_of_Ohio&gt;\r\n   3019 &lt;http:\/\/dbpedia.org\/resource\/List_of_PlayStation_2_games&gt;\r\n   3016 &lt;http:\/\/dbpedia.org\/resource\/List_of_Lepidoptera_of_Bulgaria&gt;\r\n   3005 &lt;http:\/\/dbpedia.org\/resource\/List_of_voice_actors&gt;\r\n<\/code><\/pre>\n<h3>Observations:<\/h3>\n<p>The top subjects are clearly dominated by list-like resources. Very big &#8220;normal&#8221; articles such as those of countries like <a href=\"http:\/\/dbpedia.org\/resource\/United_States\" target=\"_blank\">dbpedia:United_States<\/a> (1375 occurrences as subject) or <a href=\"http:\/\/dbpedia.org\/resource\/Germany\" target=\"_blank\">dbpedia:Germany<\/a> (1331 occurrences as subject) can only be found below ranks of 1518 or 1673. Scrolling through the top subject counts it seems that the amount of &#8220;List&#8221; vs. non-&#8220;List&#8221; resources slowly seems to equalize around 1000 occurrences (rank 3800+), but even for subjects that &#8220;only&#8221; occur ~500 times (rank 21000+) there seem to be ~1\/4 &#8220;Lists&#8221; still.<\/p>\n<h2>Top 100 Predicates:<\/h2>\n<p><a href=\"http:\/\/projects.dfki.uni-kl.de\/~hees\/dbpedia\/2014\/stats\/dbpedia_2_predicate_counts_tops.txt.gz\" target=\"_blank\">dbpedia_2_predicate_counts_tops.txt.gz<\/a> (384K)<\/p>\n<pre><code>149707899 &lt;http:\/\/dbpedia.org\/ontology\/wikiPageWikiLink&gt;\r\n86391520 &lt;http:\/\/www.w3.org\/1999\/02\/22-rdf-syntax-ns#type&gt;\r\n33958849 &lt;http:\/\/www.w3.org\/2002\/07\/owl#sameAs&gt;\r\n18731754 &lt;http:\/\/purl.org\/dc\/terms\/subject&gt;\r\n13926391 &lt;http:\/\/www.w3.org\/2000\/01\/rdf-schema#label&gt;\r\n13494896 &lt;http:\/\/dbpedia.org\/ontology\/wikiPageRevisionID&gt;\r\n13494875 &lt;http:\/\/www.w3.org\/ns\/prov#wasDerivedFrom&gt;\r\n13494819 &lt;http:\/\/dbpedia.org\/ontology\/wikiPageID&gt;\r\n10948106 &lt;http:\/\/dbpedia.org\/ontology\/wikiPageOutDegree&gt;\r\n10948106 &lt;http:\/\/dbpedia.org\/ontology\/wikiPageLength&gt;\r\n10948086 &lt;http:\/\/xmlns.com\/foaf\/0.1\/primaryTopic&gt;\r\n10948086 &lt;http:\/\/xmlns.com\/foaf\/0.1\/isPrimaryTopicOf&gt;\r\n10948086 &lt;http:\/\/purl.org\/dc\/elements\/1.1\/language&gt;\r\n7081593 &lt;http:\/\/dbpedia.org\/ontology\/wikiPageExternalLink&gt;\r\n6473988 &lt;http:\/\/dbpedia.org\/ontology\/wikiPageRedirects&gt;\r\n5926272 &lt;http:\/\/dbpedia.org\/ontology\/abstract&gt;\r\n5925778 &lt;http:\/\/www.w3.org\/2000\/01\/rdf-schema#comment&gt;\r\n4267352 &lt;http:\/\/xmlns.com\/foaf\/0.1\/name&gt;\r\n4041585 &lt;http:\/\/dbpedia.org\/property\/hasPhotoCollection&gt;\r\n3781737 &lt;http:\/\/dbpedia.org\/property\/name&gt;\r\n2342002 &lt;http:\/\/purl.org\/dc\/elements\/1.1\/rights&gt;\r\n2268299 &lt;http:\/\/www.w3.org\/2004\/02\/skos\/core#broader&gt;\r\n2084717 &lt;http:\/\/purl.org\/dc\/elements\/1.1\/description&gt;\r\n1514496 &lt;http:\/\/dbpedia.org\/ontology\/team&gt;\r\n1374565 &lt;http:\/\/xmlns.com\/foaf\/0.1\/depiction&gt;\r\n1374185 &lt;http:\/\/dbpedia.org\/ontology\/thumbnail&gt;\r\n1363398 &lt;http:\/\/dbpedia.org\/ontology\/wikiPageDisambiguates&gt;\r\n1289141 &lt;http:\/\/dbpedia.org\/property\/title&gt;\r\n1231780 &lt;http:\/\/dbpedia.org\/property\/subdivisionType&gt;\r\n1171004 &lt;http:\/\/xmlns.com\/foaf\/0.1\/thumbnail&gt;\r\n1122598 &lt;http:\/\/www.w3.org\/2004\/02\/skos\/core#prefLabel&gt;\r\n1080114 &lt;http:\/\/xmlns.com\/foaf\/0.1\/givenName&gt;\r\n1058532 &lt;http:\/\/www.georss.org\/georss\/point&gt;\r\n1052578 &lt;http:\/\/dbpedia.org\/property\/shortDescription&gt;\r\n1052115 &lt;http:\/\/xmlns.com\/foaf\/0.1\/surname&gt;\r\n1005079 &lt;http:\/\/dbpedia.org\/ontology\/birthPlace&gt;\r\n 995639 &lt;http:\/\/dbpedia.org\/ontology\/birthDate&gt;\r\n 983813 &lt;http:\/\/dbpedia.org\/property\/subdivisionName&gt;\r\n 973597 &lt;http:\/\/dbpedia.org\/ontology\/birthYear&gt;\r\n 968085 &lt;http:\/\/dbpedia.org\/property\/dateOfBirth&gt;\r\n 907869 &lt;http:\/\/www.w3.org\/2003\/01\/geo\/wgs84_pos#lat&gt;\r\n 906919 &lt;http:\/\/www.w3.org\/2003\/01\/geo\/wgs84_pos#long&gt;\r\n 861765 &lt;http:\/\/dbpedia.org\/property\/goals&gt;\r\n 846283 &lt;http:\/\/dbpedia.org\/property\/placeOfBirth&gt;\r\n 846182 &lt;http:\/\/dbpedia.org\/ontology\/isPartOf&gt;\r\n 838381 &lt;http:\/\/dbpedia.org\/property\/birthPlace&gt;\r\n 826348 &lt;http:\/\/dbpedia.org\/property\/years&gt;\r\n 656559 &lt;http:\/\/dbpedia.org\/property\/length&gt;\r\n 653929 &lt;http:\/\/dbpedia.org\/property\/date&gt;\r\n 649375 &lt;http:\/\/xmlns.com\/foaf\/0.1\/homepage&gt;\r\n 643162 &lt;http:\/\/dbpedia.org\/ontology\/careerStation&gt;\r\n 641528 &lt;http:\/\/dbpedia.org\/ontology\/years&gt;\r\n 574296 &lt;http:\/\/dbpedia.org\/property\/birthDate&gt;\r\n 556627 &lt;http:\/\/dbpedia.org\/property\/genre&gt;\r\n 553122 &lt;http:\/\/dbpedia.org\/ontology\/country&gt;\r\n 539366 &lt;http:\/\/dbpedia.org\/property\/clubs&gt;\r\n 529649 &lt;http:\/\/dbpedia.org\/property\/location&gt;\r\n 525787 &lt;http:\/\/dbpedia.org\/property\/rd1Team&gt;\r\n 512507 &lt;http:\/\/dbpedia.org\/ontology\/numberOfGoals&gt;\r\n 501875 &lt;http:\/\/dbpedia.org\/ontology\/genre&gt;\r\n 492028 &lt;http:\/\/dbpedia.org\/ontology\/numberOfMatches&gt;\r\n 453911 &lt;http:\/\/dbpedia.org\/ontology\/deathDate&gt;\r\n 449759 &lt;http:\/\/dbpedia.org\/ontology\/deathYear&gt;\r\n 448696 &lt;http:\/\/dbpedia.org\/property\/dateOfDeath&gt;\r\n 448362 &lt;http:\/\/www.w3.org\/2002\/07\/owl#equivalentClass&gt;\r\n 446799 &lt;http:\/\/dbpedia.org\/property\/caption&gt;\r\n 446238 &lt;http:\/\/www.w3.org\/2000\/01\/rdf-schema#subClassOf&gt;\r\n 440121 &lt;http:\/\/dbpedia.org\/property\/votes&gt;\r\n 437797 &lt;http:\/\/dbpedia.org\/property\/wordnet_type&gt;\r\n 435648 &lt;http:\/\/dbpedia.org\/property\/type&gt;\r\n 431262 &lt;http:\/\/dbpedia.org\/property\/caps&gt;\r\n 418391 &lt;http:\/\/dbpedia.org\/ontology\/utcOffset&gt;\r\n 362378 &lt;http:\/\/dbpedia.org\/property\/percentage&gt;\r\n 362327 &lt;http:\/\/dbpedia.org\/ontology\/type&gt;\r\n 355814 &lt;http:\/\/dbpedia.org\/property\/country&gt;\r\n 346584 &lt;http:\/\/dbpedia.org\/property\/candidate&gt;\r\n 340788 &lt;http:\/\/dbpedia.org\/property\/starring&gt;\r\n 338307 &lt;http:\/\/dbpedia.org\/ontology\/location&gt;\r\n 327879 &lt;http:\/\/dbpedia.org\/ontology\/family&gt;\r\n 326730 &lt;http:\/\/dbpedia.org\/property\/longew&gt;\r\n 326699 &lt;http:\/\/dbpedia.org\/property\/latns&gt;\r\n 315041 &lt;http:\/\/dbpedia.org\/property\/writer&gt;\r\n 314566 &lt;http:\/\/dbpedia.org\/ontology\/starring&gt;\r\n 312958 &lt;http:\/\/dbpedia.org\/property\/label&gt;\r\n 310020 &lt;http:\/\/dbpedia.org\/property\/rd2Team&gt;\r\n 306816 &lt;http:\/\/dbpedia.org\/property\/settlementType&gt;\r\n 306271 &lt;http:\/\/dbpedia.org\/property\/longd&gt;\r\n 306246 &lt;http:\/\/dbpedia.org\/property\/latd&gt;\r\n 306195 &lt;http:\/\/dbpedia.org\/ontology\/populationTotal&gt;\r\n 293237 &lt;http:\/\/dbpedia.org\/property\/team&gt;\r\n 283282 &lt;http:\/\/dbpedia.org\/property\/producer&gt;\r\n 279814 &lt;http:\/\/dbpedia.org\/ontology\/occupation&gt;\r\n 278108 &lt;http:\/\/dbpedia.org\/ontology\/order&gt;\r\n 277006 &lt;http:\/\/dbpedia.org\/property\/episodenumber&gt;\r\n 275430 &lt;http:\/\/dbpedia.org\/property\/longm&gt;\r\n 275416 &lt;http:\/\/dbpedia.org\/property\/latm&gt;\r\n 272562 &lt;http:\/\/dbpedia.org\/ontology\/deathPlace&gt;\r\n 267256 &lt;http:\/\/dbpedia.org\/ontology\/class&gt;\r\n 265454 &lt;http:\/\/dbpedia.org\/property\/timezone&gt;\r\n 264081 &lt;http:\/\/dbpedia.org\/ontology\/viafId&gt;\r\n<\/code><\/pre>\n<h3>Observations:<\/h3>\n<p>The predicates are clearly dominated by <a href=\"http:\/\/dbpedia.org\/ontology\/wikiPageWikiLink\" target=\"_blank\">dbpedia-owl:wikiPageWikiLink<\/a> and <a href=\"http:\/\/www.w3.org\/1999\/02\/22-rdf-syntax-ns#type\" target=\"_blank\">rdf:type<\/a> relations.<br \/>\nWhat&#8217;s a bit surprising for me is that <a href=\"http:\/\/purl.org\/dc\/terms\/subject\" target=\"_blank\">dcterms:subject<\/a> occurs less often than <a href=\"http:\/\/www.w3.org\/1999\/02\/22-rdf-syntax-ns#type\" target=\"_blank\">rdf:type<\/a>, but my guess is that it&#8217;s probably due to YAGO and also hierarchy materialization (Athlete is also a Person). There&#8217;s a slight mismatch between <a href=\"http:\/\/dbpedia.org\/ontology\/wikiPageRevisionID\" target=\"_blank\">dbpedia-owl:wikiPageRevisionID<\/a> and <a href=\"http:\/\/www.w3.org\/ns\/prov#wasDerivedFrom\" target=\"_blank\">prov:wasDerivedFrom<\/a>. There are more <a href=\"http:\/\/dbpedia.org\/ontology\/abstract\" target=\"_blank\">dbpedia-ontology:abstract<\/a>s than <a href=\"http:\/\/www.w3.org\/2000\/01\/rdf-schema#comment\" target=\"_blank\">rdfs:comment<\/a>s and more <a href=\"http:\/\/www.w3.org\/2003\/01\/geo\/wgs84_pos#lat\" target=\"_blank\">geo:lat<\/a>s than <a href=\"http:\/\/www.w3.org\/2003\/01\/geo\/wgs84_pos#long\" target=\"_blank\">geo:long<\/a>s.<\/p>\n<h2>Top 100 Objects:<\/h2>\n<p><a href=\"http:\/\/projects.dfki.uni-kl.de\/~hees\/dbpedia\/2014\/stats\/dbpedia_3_object_counts_o9_tops.txt.gz\" target=\"_blank\">dbpedia_3_object_counts_o9_tops.txt.gz<\/a> (33M)<\/p>\n<pre><code>10948086 &lt;http:\/\/xmlns.com\/foaf\/0.1\/Document&gt; .\r\n10948086 \"en\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#string&gt; .\r\n6239553 \"1\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n2250659 &lt;http:\/\/dbpedia.org\/class\/yago\/PhysicalEntity100001930&gt; .\r\n2169386 &lt;http:\/\/dbpedia.org\/class\/yago\/Object100002684&gt; .\r\n2155200 &lt;http:\/\/www.w3.org\/2002\/07\/owl#Thing&gt; .\r\n1974654 &lt;http:\/\/www.ontologydesignpatterns.org\/ont\/dul\/DUL.owl#Agent&gt; .\r\n1974654 &lt;http:\/\/dbpedia.org\/ontology\/Agent&gt; .\r\n1816213 &lt;http:\/\/dbpedia.org\/class\/yago\/YagoLegalActorGeo&gt; .\r\n1650316 &lt;http:\/\/xmlns.com\/foaf\/0.1\/Person&gt; .\r\n1649647 &lt;http:\/\/wikidata.dbpedia.org\/resource\/Q5&gt; .\r\n1649647 &lt;http:\/\/wikidata.dbpedia.org\/resource\/Q215627&gt; .\r\n1649647 &lt;http:\/\/schema.org\/Person&gt; .\r\n1649646 &lt;http:\/\/www.ontologydesignpatterns.org\/ont\/dul\/DUL.owl#NaturalPerson&gt; .\r\n1649646 &lt;http:\/\/dbpedia.org\/ontology\/Person&gt; .\r\n1621660 &lt;http:\/\/dbpedia.org\/class\/yago\/Whole100003553&gt; .\r\n1318799 &lt;http:\/\/dbpedia.org\/resource\/Category:Living_people&gt; .\r\n1290718 &lt;http:\/\/dbpedia.org\/class\/yago\/YagoLegalActor&gt; .\r\n1257968 &lt;http:\/\/dbpedia.org\/class\/yago\/YagoPermanentlyLocatedEntity&gt; .\r\n1192248 &lt;http:\/\/www.w3.org\/2004\/02\/skos\/core#Concept&gt; .\r\n1090313 &lt;http:\/\/dbpedia.org\/class\/yago\/LivingThing100004258&gt; .\r\n1090140 &lt;http:\/\/dbpedia.org\/class\/yago\/Organism100004475&gt; .\r\n1046726 &lt;http:\/\/dbpedia.org\/class\/yago\/Person100007846&gt; .\r\n1020287 &lt;http:\/\/dbpedia.org\/class\/yago\/CausalAgent100007347&gt; .\r\n 868376 &lt;http:\/\/dbpedia.org\/resource\/United_States&gt; .\r\n 816854 &lt;http:\/\/www.ontologydesignpatterns.org\/ont\/d0.owl#Location&gt; .\r\n 816837 &lt;http:\/\/schema.org\/Place&gt; .\r\n 816837 &lt;http:\/\/dbpedia.org\/ontology\/Wikidata:Q532&gt; .\r\n 816837 &lt;http:\/\/dbpedia.org\/ontology\/Place&gt; .\r\n 814269 &lt;http:\/\/dbpedia.org\/class\/yago\/YagoGeoEntity&gt; .\r\n 726965 &lt;http:\/\/dbpedia.org\/class\/yago\/Abstraction100002137&gt; .\r\n 658562 &lt;http:\/\/www.ontologydesignpatterns.org\/ont\/dul\/DUL.owl#Situation&gt; .\r\n 643162 &lt;http:\/\/dbpedia.org\/ontology\/CareerStation&gt; .\r\n 561841 &lt;http:\/\/dbpedia.org\/class\/yago\/LivingPeople&gt; .\r\n 547827 &lt;http:\/\/dbpedia.org\/ontology\/PopulatedPlace&gt; .\r\n 547037 \"0\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#integer&gt; .\r\n 539993 \"1\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#integer&gt; .\r\n 531929 &lt;http:\/\/www.opengis.net\/gml\/_Feature&gt; .\r\n 528794 &lt;http:\/\/dbpedia.org\/class\/yago\/Artifact100021939&gt; .\r\n 526256 &lt;http:\/\/www.w3.org\/2003\/01\/geo\/wgs84_pos#SpatialThing&gt; .\r\n 524742 &lt;http:\/\/dbpedia.org\/class\/yago\/Location100027167&gt; .\r\n 505425 &lt;http:\/\/dbpedia.org\/class\/yago\/Region108630985&gt; .\r\n 476724 \"2\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#integer&gt; .\r\n 469006 &lt;http:\/\/dbpedia.org\/ontology\/Settlement&gt; .\r\n 438713 &lt;http:\/\/www.ontologydesignpatterns.org\/ont\/dul\/DUL.owl#InformationEntity&gt; .\r\n 425044 &lt;http:\/\/schema.org\/CreativeWork&gt; .\r\n 425044 &lt;http:\/\/dbpedia.org\/ontology\/Work&gt; .\r\n 419234 &lt;http:\/\/dbpedia.org\/resource\/List_of_sovereign_states&gt; .\r\n 401287 \"N\"@en .\r\n 400317 &lt;http:\/\/dbpedia.org\/resource\/Animal&gt; .\r\n 377252 \"3\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#integer&gt; .\r\n 358891 &lt;http:\/\/dbpedia.org\/class\/yago\/GeographicalArea108574314&gt; .\r\n 350209 &lt;http:\/\/dbpedia.org\/class\/yago\/District108552138&gt; .\r\n 347718 &lt;http:\/\/dbpedia.org\/class\/yago\/Group100031264&gt; .\r\n 336091 &lt;http:\/\/dbpedia.org\/ontology\/Athlete&gt; .\r\n 335320 \"4\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#integer&gt; .\r\n 321614 &lt;http:\/\/dbpedia.org\/class\/yago\/AdministrativeDistrict108491826&gt; .\r\n 313062 &lt;http:\/\/dbpedia.org\/class\/yago\/SocialGroup107950920&gt; .\r\n 302658 &lt;http:\/\/www.ontologydesignpatterns.org\/ont\/dul\/DUL.owl#SocialPerson&gt; .\r\n 302658 &lt;http:\/\/schema.org\/Organization&gt; .\r\n 302658 &lt;http:\/\/dbpedia.org\/ontology\/Organisation&gt; .\r\n 292395 \"28\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 288074 \"29\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 287957 \"27\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 286256 \"5\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#integer&gt; .\r\n 283702 &lt;http:\/\/dbpedia.org\/class\/yago\/Organization108008335&gt; .\r\n 279633 \"yes\"@en .\r\n 279134 &lt;http:\/\/dbpedia.org\/ontology\/SportsTeamMember&gt; .\r\n 279134 &lt;http:\/\/dbpedia.org\/ontology\/OrganisationMember&gt; .\r\n 277439 \"30\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 277025 \"26\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 276000 \"6\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#integer&gt; .\r\n 264578 \"31\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 263773 &lt;http:\/\/dbpedia.org\/class\/yago\/Contestant109613191&gt; .\r\n 261435 &lt;http:\/\/www.ontologydesignpatterns.org\/ont\/dul\/DUL.owl#Organism&gt; .\r\n 261435 &lt;http:\/\/dbpedia.org\/ontology\/Species&gt; .\r\n 260007 \"E\"@en .\r\n 256474 &lt;http:\/\/dbpedia.org\/ontology\/Eukaryote&gt; .\r\n 255993 \"25\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 252172 &lt;http:\/\/dbpedia.org\/resource\/England&gt; .\r\n 249675 \"32\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 240706 &lt;http:\/\/dbpedia.org\/resource\/Iran_Standard_Time&gt; .\r\n 234043 \"24\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 231236 \"33\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 228539 \"7\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#integer&gt; .\r\n 221419 \"0\".\r\n 219180 &lt;http:\/\/dbpedia.org\/class\/yago\/Player110439851&gt; .\r\n 218573 \"23\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 218297 &lt;http:\/\/dbpedia.org\/class\/yago\/PsychologicalFeature100023100&gt; .\r\n 217297 \"34\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 215320 &lt;http:\/\/dbpedia.org\/class\/yago\/Athlete109820263&gt; .\r\n 214359 \"18\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 208654 &lt;http:\/\/dbpedia.org\/class\/yago\/Tract108673395&gt; .\r\n 208383 &lt;http:\/\/dbpedia.org\/resource\/Arthropod&gt; .\r\n 206076 \"22\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 205888 \"8\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#integer&gt; .\r\n 204693 &lt;http:\/\/dbpedia.org\/resource\/Lepidoptera&gt; .\r\n 204220 \"21\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n 203472 &lt;http:\/\/dbpedia.org\/class\/yago\/Instrumentality103575240&gt; .\r\n 202270 \"35\"^^&lt;http:\/\/www.w3.org\/2001\/XMLSchema#nonNegativeInteger&gt; .\r\n<\/code><\/pre>\n<h3>Observations:<\/h3>\n<p>The object counts are dominated on top with an order of magnitude difference by <a href=\"http:\/\/xmlns.com\/foaf\/0.1\/Document\" target=\"_blank\">foaf:Document<\/a> and &#8220;en&#8221;. The non-negative &#8220;1&#8221; follows an order of magnitude ahead of the normal &#8220;0&#8221; and &#8220;1&#8221; \ud83d\ude09 In between a lot of very useful types follow, and we can see that we have a lot of information about physical things, people, concepts and places. It&#8217;s also nice to see <a href=\"http:\/\/wikidata.dbpedia.org\/resource\/Q5\" target=\"_blank\">http:\/\/wikidata.dbpedia.org\/resource\/Q5<\/a> right under <a href=\"http:\/\/xmlns.com\/foaf\/0.1\/Person\" target=\"_blank\">foaf:Person<\/a>, even though the URI doesn&#8217;t resolve anymore(?) \ud83d\ude41<\/p>\n<p>The first &#8220;real&#8221; &#8220;A-Box&#8221; resource is <a href=\"http:\/\/dbpedia.org\/resource\/United_States\" target=\"_blank\">dbpedia:United_States<\/a>, followed by <a href=\"http:\/\/dbpedia.org\/resource\/Animal\" target=\"_blank\">dbpedia:Animal<\/a>, <a href=\"http:\/\/dbpedia.org\/resource\/England\" target=\"_blank\">dbpedia:England<\/a>, <a href=\"http:\/\/dbpedia.org\/resource\/Iran_Standard_Time\" target=\"_blank\">dbpedia:Iran_Standard_Time<\/a>, <a href=\"http:\/\/dbpedia.org\/resource\/Arthropod\" target=\"_blank\">dbpedia:Arthropod<\/a>, <a href=\"http:\/\/dbpedia.org\/resource\/Lepidoptera\" target=\"_blank\">dbpedia:Lepidoptera<\/a>, <a href=\"http:\/\/dbpedia.org\/resource\/Canada\" target=\"_blank\">dbpedia:Canada<\/a>, <a href=\"http:\/\/dbpedia.org\/resource\/Insect\" target=\"_blank\">dbpedia:Insect<\/a>, <a href=\"http:\/\/dbpedia.org\/resource\/France\" target=\"_blank\">dbpedia:France<\/a>, <a href=\"http:\/\/dbpedia.org\/resource\/United_Kingdom\" target=\"_blank\">dbpedia:United_Kingdom<\/a>, <a href=\"http:\/\/dbpedia.org\/resource\/India\" target=\"_blank\">dbpedia:India<\/a>, <a href=\"http:\/\/dbpedia.org\/resource\/Germany\" target=\"_blank\">dbpedia:Germany<\/a>. In general it seems as if apart from ontology types many instances of types country, biological genus and city occur very often as objects.<br \/>\nThe top literals seem to be numbers, especially years and single letters.<\/p>\n<h2>Conclusion<\/h2>\n<p>We&#8217;ve seen that it&#8217;s sadly not possible to get basic top-degree-counts for big datasets via SPARQL, as the endpoints don&#8217;t seem to be optimized for these kind of queries. I hope this changes in the future as it&#8217;s quite useful to know degree distributions for all kinds of queries. Especially in the machine learning sector it seems quite essential to know if you&#8217;re dealing with a &#8220;normal&#8221; node or one of the exceptional top nodes that is several orders of magnitude bigger than the rest.<\/p>\n<p>Hope you enjoyed. Feedback welcome, as always.<\/p>\n<h2>Further reading<\/h2>\n<p>Thanks for <a href=\"http:\/\/sourceforge.net\/p\/dbpedia\/mailman\/message\/33284348\/\" target=\"_blank\">all the feedback<\/a> i got on this post. There are somewhat similar works, that you might be interested in:<\/p>\n<ul>\n<li><a href=\"http:\/\/s16a.org\/node\/6\" target=\"_blank\">Dinesh Reddy, Magnus Knuth, Harald Sack: DBpedia GraphMeasures<\/a> based on the wikiPageWikiLink property, computes PageRank, HITS, Inlink and Outlink degree for each Wikipedia Article. (These datasets are actually loaded on <a href=\"http:\/\/dbpedia.org\/sparql\" target=\"_blank\">http:\/\/dbpedia.org\/sparql<\/a>, just not in the default graph http:\/\/dbpedia.org.)<\/li>\n<li><a href=\"http:\/\/dbtrends.aksw.org\" target=\"_blank\">http:\/\/dbtrends.aksw.org\/<\/a> calculates some stats similar to this post (and some more), but atm sadly only for DBpedia 3.9 and without stats about Literals and http:\/\/dbpedia.org\/resource\/Category:* resources.<\/li>\n<\/ul>\n<h2>Updates:<\/h2>\n<ul>\n<li>2015-02-01: Further reading<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Ever wondered what the top subjects \/ predicates \/ objects are in DBpedia? I recently came across this problem while trying to draw a random sample of nodes from DBpedia which follow a given degree distribution for my PhD. Turns out this is actually more difficult than i expected. Mostly due to the fact that [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":""},"categories":[2,198],"tags":[12,206,34,209,207,201,88,202,205,204,199,155,200,203,208],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pYA5n-an","jetpack-related-posts":[{"id":610,"url":"https:\/\/joernhees.de\/blog\/2014\/11\/10\/setting-up-a-local-dbpedia-2014-mirror-with-virtuoso-7-1-0\/","url_meta":{"origin":643,"position":0},"title":"Setting up a local DBpedia 2014 mirror with Virtuoso 7.1.0","date":"2014-11-10","format":false,"excerpt":"Newer version available: Setting up a Linked Data mirror from RDF dumps (DBpedia 2015-04, Freebase, Wikidata, LinkedGeoData, ...) with Virtuso 7.2.1 and Docker (optional) So you're the guy who is allowed to setup a local DBpedia mirror or more generally a local Linked Data mirror for your work group? OK,\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":731,"url":"https:\/\/joernhees.de\/blog\/2015\/11\/23\/setting-up-a-linked-data-mirror-from-rdf-dumps-dbpedia-2015-04-freebase-wikidata-linkedgeodata-with-virtuoso-7-2-1-and-docker-optional\/","url_meta":{"origin":643,"position":1},"title":"Setting up a Linked Data mirror from RDF dumps (DBpedia 2015-04, Freebase, Wikidata, LinkedGeoData, ...) with Virtuoso 7.2.1 and Docker (optional)","date":"2015-11-23","format":false,"excerpt":"So you're the guy who is allowed to setup a local DBpedia mirror or more generally a local Linked Data mirror for your work group? OK, today is your lucky day and you're in the right place. I hope you'll be able to benefit from my many hours of trials\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":277,"url":"https:\/\/joernhees.de\/blog\/2010\/10\/31\/setting-up-a-local-dbpedia-mirror-with-virtuoso\/","url_meta":{"origin":643,"position":2},"title":"Setting up a local DBpedia mirror with Virtuoso","date":"2010-10-31","format":false,"excerpt":"So you're the guy who is allowed to setup a local DBpedia mirror for your work group? OK, today is your lucky day and you're in the right place. I hope you'll be able to benefit from my hours of trials and errors ;)","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":584,"url":"https:\/\/joernhees.de\/blog\/2014\/04\/23\/setting-up-a-local-dbpedia-3-9-mirror-with-virtuoso-7\/","url_meta":{"origin":643,"position":3},"title":"Setting up a local DBpedia 3.9 mirror with Virtuoso 7","date":"2014-04-23","format":false,"excerpt":"Newer version available: Setting up a Linked Data mirror from RDF dumps (DBpedia 2015-04, Freebase, Wikidata, LinkedGeoData, ...) with Virtuso 7.2.1 and Docker (optional) I just found this aged post in my drafts folder, maybe someone will still like it... So you're the guy who is allowed to setup a\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":442,"url":"https:\/\/joernhees.de\/blog\/2012\/05\/25\/setting-up-a-local-dbpedia-3-7-mirror-with-virtuoso-6-1-5\/","url_meta":{"origin":643,"position":4},"title":"Setting up a local DBpedia 3.7 mirror with Virtuoso 6.1.5+","date":"2012-05-25","format":false,"excerpt":"Newer version available: Setting up a Linked Data mirror from RDF dumps (DBpedia 2015-04, Freebase, Wikidata, LinkedGeoData, ...) with Virtuso 7.2.1 and Docker (optional) Nearly 1.5 years after i initially published a post about how to setup a local DBpedia mirror i recently revisited the problem myself to setup a\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":454,"url":"https:\/\/joernhees.de\/blog\/2012\/05\/30\/ted-hans-rosling-religions-and-babies\/","url_meta":{"origin":643,"position":5},"title":"TED: Hans Rosling - Religions and Babies","date":"2012-05-30","format":false,"excerpt":"Heh, Hans Rosling did it again... fascinating stats and explaining world's population growth (the big fill-up) with some leftover boxes. http:\/\/www.ted.com\/talks\/hans_rosling_religions_and_babies.html","rel":"","context":"In \"open data\"","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts\/643"}],"collection":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/comments?post=643"}],"version-history":[{"count":26,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts\/643\/revisions"}],"predecessor-version":[{"id":665,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts\/643\/revisions\/665"}],"wp:attachment":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/media?parent=643"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/categories?post=643"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/tags?post=643"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}