[Wikidata] Wikidata HDT dump

Would it be an idea if HDT remains unfeasible to place the journal file of blazegraph online?
Yes, people need to use blazegraph if they want to access the files and query it but it could be an extra next to turtle dump?

Post by Laura Morales
Hello everyone,
I'd like to ask if Wikidata could please offer a HDT [1] dump along with the already available Turtle dump [2]. HDT is a binary format to store RDF data, which is pretty useful because it can be queried from command line, it can be used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space to store the same data. The problem is that it's very impractical to generate a HDT, because the current implementation requires a lot of RAM processing to convert a file. For Wikidata it will probably require a machine with 100-200GB of RAM. This is unfeasible for me because I don't have such a machine, but if you guys have one to share, I can help setup the rdf2hdt software required to convert Wikidata Turtle to HDT.
Thank you.
[1] http://www.rdfhdt.org/
[2] https://dumps.wikimedia.org/wikidatawiki/entities/
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

2017-10-27 15:17:56 UTC

Post by Jasper Koehorst
Would it be an idea if HDT remains unfeasible to place the journal file of blazegraph online?
Yes, people need to use blazegraph if they want to access the files and query it but it could be an extra next to turtle dump?

How would a blazegraph journal file be better than a Turtle dump? Maybe it's smaller in size? Simpler to use?

Jasper Koehorst

2017-10-27 16:10:49 UTC

You can mount te jnl file directly to blazegraph so loading and indexing is not needed anymore.

Van: Laura Morales
Verzonden: vrijdag 27 oktober 2017 17:18
Aan: ***@lists.wikimedia.org
CC: Discussion list for the Wikidata project.
Onderwerp: Re: [Wikidata] Wikidata HDT dump

How would a blazegraph journal file be better than a Turtle dump? Maybe it's smaller in size? Simpler to use?

Laura Morales

2017-10-27 16:13:58 UTC

Post by Jasper Koehorst
You can mount te jnl file directly to blazegraph so loading and indexing is not needed anymore.

How much larger would this be compared to the Turtle file?

Luigi Assom

2017-10-27 16:51:13 UTC

Laura, Wouter, thank you
I did not know about HDT

I found and share this resource:
http://www.rdfhdt.org/datasets/

there is also Wikidata dump in HDT

I am new to it:
is it possible to store a weighted adjacency matrix as an HDT instead of an
RDF?

Something like a list of entities for each entity, or even better a list of
tuples for each entity.
So that a tuple could be generalised with propoerties.

Here an example with one property, 'weight', and an entity 'x1' is
associated to a list of other entities, including itself.
x1 = [(w1, x1) ... (w1, xn)]

Post by Jasper Koehorst

Post by Jasper Koehorst
You can mount te jnl file directly to blazegraph so loading and indexing

is not needed anymore.
How much larger would this be compared to the Turtle file?
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Jérémie Roquet

2017-10-27 16:56:36 UTC

Post by Luigi Assom
http://www.rdfhdt.org/datasets/
there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

--
Jérémie

Jérémie Roquet

2017-10-27 16:58:34 UTC

Post by Luigi Assom
http://www.rdfhdt.org/datasets/
there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

… but there's a file on the server:
http://gaia.infor.uva.es/hdt/wikidata-20170313-all-BETA.hdt.gz (ie.
the link was missing the “.gz”)

--
Jérémie

Jasper Koehorst

2017-10-27 19:02:55 UTC

I will look into the size of the jnl file but should that not be located where the blazegraph is running from the sparql endpoint or is this a special flavour?
Was also thinking of looking into a gitlab runner which occasionally could generate a HDT file from the ttl dump if our server can handle it but for this an md5 sum file would be preferable or should a timestamp be sufficient?

Jasper

Post by Luigi Assom
http://www.rdfhdt.org/datasets/
there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

http://gaia.infor.uva.es/hdt/wikidata-20170313-all-BETA.hdt.gz (ie.
the link was missing the “.gz”)
--
Jérémie
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Edgard Marx

2017-10-27 19:56:10 UTC

Hey guys,

I don't know if you already knew about it,
but you can use KBox for Wikidata, DBpedia, Freebase, Lodstats...

https://github.com/AKSW/KBox

And yes, you can also use it to merge your graph with one of those....

https://github.com/AKSW/KBox#how-can-i-query-multi-bases

cheers,
<emarx>

On Oct 27, 2017 21:02, "Jasper Koehorst" <***@gmail.com> wrote:

I will look into the size of the jnl file but should that not be located
where the blazegraph is running from the sparql endpoint or is this a
special flavour?
Was also thinking of looking into a gitlab runner which occasionally could
generate a HDT file from the ttl dump if our server can handle it but for
this an md5 sum file would be preferable or should a timestamp be
sufficient?

Jasper

Post by Luigi Assom
http://www.rdfhdt.org/datasets/
there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

http://gaia.infor.uva.es/hdt/wikidata-20170313-all-BETA.hdt.gz (ie.
the link was missing the â.gzâ)
--
JÃ©rÃ©mie
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Ghislain ATEMEZING

2017-10-28 07:54:55 UTC

Hello emarx,
Many thanks for sharing KBox. Very interesting project!
One question, how do you deal with different versions of the KB, like the
case here of wikidata dump? Do you fetch their repo every xx time?
Also, for avoiding your users to re-create the models, you can pre-load
"models" from LOV catalog.

Cheers,
Ghislain

Post by Edgard Marx
Hey guys,
I don't know if you already knew about it,
but you can use KBox for Wikidata, DBpedia, Freebase, Lodstats...
https://github.com/AKSW/KBox
And yes, you can also use it to merge your graph with one of those....
https://github.com/AKSW/KBox#how-can-i-query-multi-bases
cheers,
<emarx>
I will look into the size of the jnl file but should that not be located
where the blazegraph is running from the sparql endpoint or is this a
special flavour?
Was also thinking of looking into a gitlab runner which occasionally could
generate a HDT file from the ttl dump if our server can handle it but for
this an md5 sum file would be preferable or should a timestamp be
sufficient?
Jasper

Post by Luigi Assom
http://www.rdfhdt.org/datasets/
there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

--
"*Love all, trust a few, do wrong to none*" (W. Shakespeare)

Laura Morales

2017-10-28 08:01:27 UTC

Also, for avoiding your users to re-create the models, you can pre-load "models" from LOV catalog.

The LOV RDF dump is broken instead. Or at least it still was the last time I checked. And I don't broken in the sense of Wikidata, that is with some wrong types, I mean broken as it doesn't validate at all (some triples are broken).

Ghislain ATEMEZING

2017-10-28 08:19:44 UTC

Hi Laura,
Thanks to report that. I remember one issue that I added here https://github.com/pyvandenbussche/lov/issues/66

Please shout out or flag an issue on Github! That will help on quality issue of different datasets published out there ð

Best,
Ghislain

ProvenanceÂ : Courrier pour Windows 10

DeÂ : Laura Morales
EnvoyÃ© leÂ :samedi 28 octobre 2017 10:12
ÃÂ : ***@lists.wikimedia.org
CcÂ : Discussion list for the Wikidata project.
ObjetÂ :Re: [Wikidata] Wikidata HDT dump

Also, for avoiding your users to re-create the models, you can pre-load "models" from LOV catalog.

Laura Morales

2017-10-28 09:24:22 UTC

Post by Ghislain ATEMEZING
Thanks to report that. I remember one issue that I added here https://github.com/pyvandenbussche/lov/issues/66

Yup, still broken! I've tried just now.

Ghislain ATEMEZING

2017-10-28 09:42:13 UTC

@LauraÂ : you mean this list http://lov.okfn.org/lov.nq.gz ?
I can download itÂ !!

Which oneÂ ? Please send me the URL and I can fix itÂ !!

Best,
Ghislain

ProvenanceÂ : Courrier pour Windows 10

DeÂ : Laura Morales
EnvoyÃ© leÂ :samedi 28 octobre 2017 11:24
ÃÂ : ***@lists.wikimedia.org
CcÂ : Discussion list for the Wikidata project.
ObjetÂ :Re: [Wikidata] Wikidata HDT dump

Post by Ghislain ATEMEZING
Thanks to report that. I remember one issue that I added here https://github.com/pyvandenbussche/lov/issues/66

Yup, still broken! I've tried just now.

Laura Morales

2017-10-28 11:19:17 UTC

@Laura : you mean this list http://lov.okfn.org/lov.nq.gz ?
I can download it !!
Which one ? Please send me the URL and I can fix it !!

Yes you can download it, but the nq file is broken. It doesn't validate because some URIs contains white spaces, and some triples have an empty subject (ie. <>).

Edgard Marx

2017-10-28 09:41:10 UTC

Hoi Ghislain,

On Sat, Oct 28, 2017 at 9:54 AM, Ghislain ATEMEZING <

Post by Ghislain ATEMEZING
Hello emarx,
Many thanks for sharing KBox. Very interesting project!

thanks

Post by Ghislain ATEMEZING
One question, how do you deal with different versions of the KB, like the
case here of wikidata dump?

KBox works with the so called KNS (Knowledge Name Service) servers, so any
dataset publisher can have his own KNS.
Each dataset has its own KN (Knowledge Name) that is distributed over the
KNS (Knowledge Name Service).
E.g. wikidata dump is https://www.wikidata.org/20160801.

Post by Ghislain ATEMEZING
Do you fetch their repo every xx time?

No, the idea is that each organization will have its own KNS, so users can
add the KNS that they want.
Currently all datasets available in KBox KNS are served by KBox team.
You can check all of them here kbox.tech, or using the command line (
https://github.com/AKSW/KBox#how-can-i-list-available-knowledge-bases).

Post by Ghislain ATEMEZING
Also, for avoiding your users to re-create the models, you can pre-load
"models" from LOV catalog.

We plan to share all LOD datasets in KBox, we are currently in discussing
this with W3C,
DBpedia might have its own KNS soon.
Regarding LOV catalog, you can help by just asking them to publish their
catalog in KBox.

best,
<emarx/>
http://emarx.org

Post by Ghislain ATEMEZING
Cheers,
Ghislain

Post by Edgard Marx
Hey guys,
I don't know if you already knew about it,
but you can use KBox for Wikidata, DBpedia, Freebase, Lodstats...
https://github.com/AKSW/KBox
And yes, you can also use it to merge your graph with one of those....
https://github.com/AKSW/KBox#how-can-i-query-multi-bases
cheers,
<emarx>
I will look into the size of the jnl file but should that not be located
where the blazegraph is running from the sparql endpoint or is this a
special flavour?
Was also thinking of looking into a gitlab runner which occasionally
could generate a HDT file from the ttl dump if our server can handle it but
for this an md5 sum file would be preferable or should a timestamp be
sufficient?
Jasper

Post by Luigi Assom
http://www.rdfhdt.org/datasets/
there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

--
"*Love all, trust a few, do wrong to none*" (W. Shakespeare)
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

2017-10-28 11:16:57 UTC

No, the idea is that each organization will have its own KNS, so users can add the KNS that they want.

How would this compare with a traditional SPARQL endpoint + "federated queries", or with "linked fragments"?

Edgard Marx

2017-10-28 11:41:30 UTC

Hoi Laura,

Thnks for the opportunity to clarify it.
KBox is an alternative to other existing architectures for publishing KB
such as SPARQL endpoints (e.g. LDFragments, Virtuoso), and Dump files.
I should add that you can do federated query with KBox as as easier as you
can do with SPARQL endpoints.
Here an example:

https://github.com/AKSW/KBox#how-can-i-query-multi-bases

You can use KBox either on JAVA API or command prompt.

best,
<emarx/>
http://emarx.org

Post by Edgard Marx
No, the idea is that each organization will have its own KNS, so users

can add the KNS that they want.
How would this compare with a traditional SPARQL endpoint + "federated
queries", or with "linked fragments"?
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

2017-10-28 12:31:33 UTC

KBox is an alternative to other existing architectures for publishing KB such as SPARQL endpoints (e.g. LDFragments, Virtuoso), and Dump files.
I should add that you can do federated query with KBox as as easier as you can do with SPARQL endpoints.

OK, but I still fail to see what is the value of this? What's the reason why I'd want to use it rather than just start a Fuseki endpoint, or use linked-fragments?

Edgard Marx

2017-10-28 12:56:09 UTC

Post by Edgard Marx

Post by Edgard Marx
KBox is an alternative to other existing architectures for publishing KB

such as SPARQL endpoints (e.g. LDFragments, Virtuoso), and Dump files.

Post by Edgard Marx
I should add that you can do federated query with KBox as as easier as

you can do with SPARQL endpoints.
OK, but I still fail to see what is the value of this? What's the reason
why I'd want to use it rather than just start a Fuseki endpoint, or use
linked-fragments?

I agree that KBox is not indicated to all scenarios, rather, it fits to
users that query frequently a KG,
plus do not want to spend time downloading and indexing dump files.
KBox bridge this cumbersome task, plus, it shift query execution to the
client, so no scalability issues.
BTW, if you want to work with Javascript you can also simple start an local
endpoint:

https://github.com/AKSW/KBox/blob/master/README.md#starting-a-sparql-endpoint

Post by Edgard Marx
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

2017-10-28 06:41:33 UTC

Hi!

Post by Jasper Koehorst
I will look into the size of the jnl file but should that not be
located where the blazegraph is running from the sparql endpoint or
is this a special flavour? Was also thinking of looking into a gitlab
runner which occasionally could generate a HDT file from the ttl dump
if our server can handle it but for this an md5 sum file would be
preferable or should a timestamp be sufficient?

Publishing jnl file for Blazegraph may be not as useful as one would
think, because jnl file is specific for a specific vocabulary and
certain other settings - i.e., unless you run the same WDQS code (which
customizes some of these) of the same version, you won't be able to use
the same file. Of course, since WDQS code is open source, it may be good
enough, so in general publishing such file may be possible.

Currently, it's about 300G size uncompressed. No idea how much
compressed. Loading it takes a couple of days on reasonably powerful
machine, more on labs ones (I haven't tried to load full dump on labs
for a while, since labs VMs are too weak for that).

In general, I'd say it'd take about 100M per million of triples. Less if
triples are using repeated URIs, probably more if they contain ton of
text data.

--
Stas Malyshev
***@wikimedia.org

Ghislain ATEMEZING

2017-10-28 08:15:39 UTC

Hi,
+1 to not share the jrnl file !
I agree with Stas that it doesnât seem a best practice to publish a specific journal file for a given RDF store (here for blazegraph).
Regarding the size of that jrnl file, I remember having one project with almost 500M for 1 billion triples (~ 1/2 size of disk of the dataset).

Best,
Ghislain

ProvenanceÂ : Courrier pour Windows 10

DeÂ : Stas Malyshev
EnvoyÃ© leÂ :samedi 28 octobre 2017 08:42
ÃÂ : Discussion list for the Wikidata project.; Jasper Koehorst
ObjetÂ :Re: [Wikidata] Wikidata HDT dump

Hi!

--
Stas Malyshev
***@wikimedia.org

Jérémie Roquet

2017-10-27 19:01:37 UTC

Post by Luigi Assom
http://www.rdfhdt.org/datasets/
there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

Javier D. Fernández of the HDT team was very quick to fix the link :-)

One can contact them for support either on their forum or by email¹,
as they are willing to help the Wikidata community make use of HDT.

Best regards,

¹ http://www.rdfhdt.org/team/

--
Jérémie

Laura Morales

2017-10-27 21:14:20 UTC

Post by JÃ©rÃ©mie Roquet
Javier D. Fernández of the HDT team was very quick to fix the link :-)

their dump is almost ~1 year old though.

Laura Morales

2017-10-27 21:04:58 UTC

is it possible to store a weighted adjacency matrix as an HDT instead of an RDF?
Something like a list of entities for each entity, or even better a list of tuples for each entity.
So that a tuple could be generalised with propoerties.

Sorry I don't know this, you would have to ask the devs. As far as I understand, it's a triplestore and that should be it...

Wouter Beek

2017-10-27 15:20:45 UTC

Dear Laura, others,

If somebody points me to the RDF datadump of Wikidata I can deliver an
HDT version for it, no problem. (Given the current cost of memory I
do not believe that the memory consumption for HDT creation is a
blocker.)

---
Cheers,
Wouter Beek.

Email: ***@triply.cc
WWW: http://triply.cc
Tel: +31647674624

Ghislain ATEMEZING

2017-10-27 15:29:12 UTC

@Wouter: See here https://dumps.wikimedia.org/wikidatawiki/entities/ ?
Nice idea LAura.

Ghislain

Post by Wouter Beek
Dear Laura, others,
If somebody points me to the RDF datadump of Wikidata I can deliver an
HDT version for it, no problem. (Given the current cost of memory I
do not believe that the memory consumption for HDT creation is a
blocker.)
---
Cheers,
Wouter Beek.
WWW: http://triply.cc
Tel: +31647674624 <+31%206%2047674624>

Post by Laura Morales
Hello everyone,
I'd like to ask if Wikidata could please offer a HDT [1] dump along with

the already available Turtle dump [2]. HDT is a binary format to store RDF
data, which is pretty useful because it can be queried from command line,
it can be used as a Jena/Fuseki source, and it also uses
orders-of-magnitude less space to store the same data. The problem is that
it's very impractical to generate a HDT, because the current implementation
requires a lot of RAM processing to convert a file. For Wikidata it will
probably require a machine with 100-200GB of RAM. This is unfeasible for me
because I don't have such a machine, but if you guys have one to share, I
can help setup the rdf2hdt software required to convert Wikidata Turtle to
HDT.

Post by Laura Morales
Thank you.
[1] http://www.rdfhdt.org/
[2] https://dumps.wikimedia.org/wikidatawiki/entities/
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

--
-------
"Love all, trust a few, do wrong to none" (W. Shakespeare)
Web: http://atemezing.org

Wouter Beek

2017-10-27 22:34:27 UTC

Hi Ghislain,

Post by Ghislain ATEMEZING
@Wouter: See here https://dumps.wikimedia.org/wikidatawiki/entities/ ?

Thanks for the pointer! I'm downloading from
https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz now.

The Content-Type header for that URI seems incorrect to me: it says
`application/octet-stream`, but the file actually contains `text/turtle'.
(For specifying the compression mechanism the Content-Encoding header
should be used.)

The first part of the Turtle data stream seems to contain syntax errors for
some of the XSD decimal literals. The first one appears on line 13,291:

```text/turtle
<http://www.wikidata.org/value/ec714e2ba0fd71ec7256d3f7f7606c35> <
http://wikiba.se/ontology-beta#geoPrecision> "1.0E-6"^^<
http://www.w3.org/2001/XMLSchema#decimal> .
```

Notice that scientific notation is not allowed in the lexical form of
decimals according to XML Schema Part 2: Datatypes
<https://www.w3.org/TR/xmlschema11-2/#decimal>. (It is allowed in floats
and doubles.) Is this a known issue or should I report this somewhere?

---
Cheers!,
Wouter.

Email: ***@triply.cc
WWW: http://triply.cc
Tel: +31647674624

Laura Morales

2017-10-28 06:04:34 UTC

Notice that scientific notation is not allowed in the lexical form of decimals according to XML > Schema Part 2: Datatypes[https://www.w3.org/TR/xmlschema11-2/#decimal]. (It is allowed in floats and doubles.) Is this a known issue or should I report this somewhere?

I wouldn't call these "syntax" errors, just "logical/type" errors.
It would be great if these could fixed by changing the correct type from decimal to float/double. On the other hand, I've never seen any medium or large dataset without this kind of errors. So I would personally treat these as warnings at worst.

@Wouter when you build the HDT file, could you please also generate the .hdt.index file? With rdf2hdt, this should be activated with the -i flag. Thank you again!

Stas Malyshev

2017-10-28 06:50:35 UTC

Hi!

Post by Laura Morales
I wouldn't call these "syntax" errors, just "logical/type" errors.
It would be great if these could fixed by changing the correct type from decimal to float/double. On the other hand, I've never seen any medium or large dataset without this kind of errors. So I would personally treat these as warnings at worst.

--
Stas Malyshev
***@wikimedia.org

Ghislain ATEMEZING

2017-10-28 07:34:10 UTC

Hi,
@Wouter: As Stas said, you might report that error. I don't agree with Laura who tried to under estimate that "syntax error". It's also about quality ;)

Many thanks in advance !
@Laura: Do you have a different rdf2hdt program or the one in the GitHub of HDT project ?

Best,
Ghislain

Sent from my iPhone, may include typos

Post by Stas Malyshev
Hi!

Float/double are range-limited and have limited precision. Decimals are
not. Whether it is important for geo precision, needs to be checked, but
we could be hitting the limits of precision pretty quickly.
--
Stas Malyshev
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

2017-10-28 07:57:31 UTC

Post by Ghislain ATEMEZING
@Wouter: As Stas said, you might report that error. I don't agree with Laura who tried to under estimate that "syntax error". It's also about quality ;)

Don't get me wrong, I am all in favor of data quality! :) So if this can be fixed, it's better! The thing is, that I've seen so many datasets with these kind of type errors, that by now I pretty much live with them and I'm OK with these warnings (the triple is not broken after all, it's just not following the standards).

Post by Ghislain ATEMEZING
@Laura: Do you have a different rdf2hdt program or the one in the GitHub of HDT project ?

I just use https://github.com/rdfhdt/hdt-cpp compiled from the master branch. To verify data instead, I use RIOT (a CL tool from the Apache Jena package) like this `riot --validate file.nt`.

Stas Malyshev

2017-10-28 06:48:00 UTC

Hi!

Post by Wouter Beek
The first part of the Turtle data stream seems to contain syntax errors
```text/turtle
<http://www.wikidata.org/value/ec714e2ba0fd71ec7256d3f7f7606c35>
<http://wikiba.se/ontology-beta#geoPrecision>
"1.0E-6"^^<http://www.w3.org/2001/XMLSchema#decimal> .
```

Could you submit a phabricator task (phabricator.wikimedia.org) about
this? If it's against the standard it certainly should not be encoded
like that.

--
Stas Malyshev
***@wikimedia.org

Stas Malyshev

2017-10-28 11:26:28 UTC

Hi!

I've added https://phabricator.wikimedia.org/T179228 to handle this.
geoPrecision is a float value and assigning decimal type to it is a
mistake. I'll review other properties to see if we don't have more of
this. Thanks for bringing it to my attention!

--
Stas Malyshev
***@wikimedia.org

Laura Morales

2017-10-31 06:52:52 UTC

@Wouter

Thanks for the pointer! I'm downloading from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz now.

Any luck so far?

Ghislain ATEMEZING

2017-10-31 11:09:03 UTC

@Laura: I suspect Wouter wants to know if he "ignores" the previous errors
and proposes a rather incomplete dump (just for you) or waits for Stas'
feedback.
Btw why don't you use the oldest version in HDT website?

Post by Ghislain ATEMEZING
@Wouter

Post by Wouter Beek
Thanks for the pointer! I'm downloading from

https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz now.
Any luck so far?
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

--
-------
"Love all, trust a few, do wrong to none" (W. Shakespeare)
Web: http://atemezing.org

Laura Morales

2017-10-31 13:56:01 UTC

@Laura: I suspect Wouter wants to know if he "ignores" the previous errors and proposes a rather incomplete dump (just for you) or waits for Stas' feedback.

OK. I wonder though, if it would be possible to setup a regular HDT dump alongside the already regular dumps. Looking at the dumps page, https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a new dump is generated once a week more or less. So if a HDT dump could be added to the schedule, it should show up with the next dump and then so forth with the future dumps. Right now even the Turtle dump contains the bad triples, so adding a HDT file now would not introduce more inconsistencies. The problem will be fixed automatically with the future dumps once the Turtle is fixed (because the HDT is generated from the .ttl file anyway).

Btw why don't you use the oldest version in HDT website?

1. I have downloaded it and I'm trying to use it, but the HDT tools (eg. query) require to build an index before I can use the HDT file. I've tried to create the index, but I ran out of memory again (even though the index is smaller than the .hdt file itself). So any Wikidata dump should contain both the .hdt file and the .hdt.index file unless there is another way to generate the index on commodity hardware

2. because it's 1 year old :)

Ghislain ATEMEZING

2017-10-31 14:13:52 UTC

Interesting use case Laura! Your UC is rather "special" :)
Let me try to understand ...
You are a "data consumer" with the following needs:
- Latest version of the data
- Quick access to the data
- You don't want to use the current ways to access the data by the
publisher (endpoint, ttl dumps, LDFragments)
However, you ask for a binary format (HDT), but you don't have enough
memory to set up your own environment/endpoint due to lack of memory.
For that reason, you are asking the publisher to support both .hdt and
.hdt.index files.

Do you think there are many users with your current UC?

Post by Ghislain ATEMEZING
@Laura: I suspect Wouter wants to know if he "ignores" the previous

errors and proposes a rather incomplete dump (just for you) or waits for
Stas' feedback.
OK. I wonder though, if it would be possible to setup a regular HDT dump
alongside the already regular dumps. Looking at the dumps page,
https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a new
dump is generated once a week more or less. So if a HDT dump could be added
to the schedule, it should show up with the next dump and then so forth
with the future dumps. Right now even the Turtle dump contains the bad
triples, so adding a HDT file now would not introduce more inconsistencies.
The problem will be fixed automatically with the future dumps once the
Turtle is fixed (because the HDT is generated from the .ttl file anyway).

Post by Ghislain ATEMEZING
Btw why don't you use the oldest version in HDT website?

1. I have downloaded it and I'm trying to use it, but the HDT tools (eg.
query) require to build an index before I can use the HDT file. I've tried
to create the index, but I ran out of memory again (even though the index
is smaller than the .hdt file itself). So any Wikidata dump should contain
both the .hdt file and the .hdt.index file unless there is another way to
generate the index on commodity hardware
2. because it's 1 year old :)
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

--
-------
"Love all, trust a few, do wrong to none" (W. Shakespeare)
Web: http://atemezing.org

Laura Morales

2017-10-31 15:03:44 UTC

I feel like you are misrepresenting my request, and possibly trying to offend me as well.

My "UC" as you call it, is simply that I would like to have a local copy of wikidata, and query it using SPARQL. Everything that I've tried so far doesn't seem to work on commodity hardware since the database is so large. But HDT could work. So I asked if a HDT dump could, please, be added to other dumps that are periodically generated by wikidata. I also told you already that *I AM* trying to use the 1 year old dump, but in order to use the HDT tools I'm told that I *MUST* generate some other index first which unfortunately I can't generate for the same reasons that I can convert the Turtle to HDT. So what I was trying to say is, that if wikidata were to add any HDT dump, this dump should contain both the .hdt file and .hdt.index in order to be useful. That's about it, and it's not just about me. Anybody who wants to have a local copy of wikidata could benefit from this, since setting up a .hdt file seems much easier than a Turtle dump. And I don't understand why you're trying to blame me for this?

If you are part of the wikidata dev team, I'd greatly appreciate a "can/can't" or "don't care" response rather than playing the passive-aggressive game that you displayed in your last email.

Post by Ghislain ATEMEZING
Let me try to understand ...
- Latest version of the data
- Quick access to the data
- You don't want to use the current ways to access the data by the publisher (endpoint, ttl dumps, LDFragments)
However, you ask for a binary format (HDT), but you don't have enough memory to set up your own environment/endpoint due to lack of memory.
For that reason, you are asking the publisher to support both .hdt and .hdt.index files.
Do you think there are many users with your current UC?

Luigi Assom

2017-10-31 19:44:13 UTC

Doh what's wrong with asking for supporting own user case "UC" ?

I think it is a totally legit question to ask, and that's why this thread
exists.

Also, I do support for possibility to help access to data that would be
hard to process from "common" hardware. Especially in the case of open data.
They exists to allow someone take them and build them - amazing if can
prototype locally, right?

I don't like the use case where a data-scientist-or-IT show to the other
data-scientist-or-IT own work looking for emotional support or praise.
I've seen that, not here, and I hope this attitude stays indeed out from
here..

I do like when the work of data-scientist-or-IT ignites someone else's
creativity - someone who is completely external - , to say: hey your work
is cool and I wanna use it for... my use case!
That's how ideas go around and help other people build complexity over
them, without constructing not necessary borders.

About a local version of compressed, index RDF - I think that if was
available, more people yes probably would use it.

Post by Laura Morales
I feel like you are misrepresenting my request, and possibly trying to offend me as well.
My "UC" as you call it, is simply that I would like to have a local copy
of wikidata, and query it using SPARQL. Everything that I've tried so far
doesn't seem to work on commodity hardware since the database is so large.
But HDT could work. So I asked if a HDT dump could, please, be added to
other dumps that are periodically generated by wikidata. I also told you
already that *I AM* trying to use the 1 year old dump, but in order to use
the HDT tools I'm told that I *MUST* generate some other index first which
unfortunately I can't generate for the same reasons that I can convert the
Turtle to HDT. So what I was trying to say is, that if wikidata were to add
any HDT dump, this dump should contain both the .hdt file and .hdt.index in
order to be useful. That's about it, and it's not just about me. Anybody
who wants to have a local copy of wikidata could benefit from this, since
setting up a .hdt file seems much easier than a Turtle dump. And I don't
understand why you're trying to blame me for this?
If you are part of the wikidata dev team, I'd greatly appreciate a
"can/can't" or "don't care" response rather than playing the
passive-aggressive game that you displayed in your last email.

Post by Ghislain ATEMEZING
Let me try to understand ...
- Latest version of the data
- Quick access to the data
- You don't want to use the current ways to access the data by the

publisher (endpoint, ttl dumps, LDFragments)

Post by Ghislain ATEMEZING
However, you ask for a binary format (HDT), but you don't have enough

memory to set up your own environment/endpoint due to lack of memory.

Post by Ghislain ATEMEZING
For that reason, you are asking the publisher to support both .hdt and

.hdt.index files.

Post by Ghislain ATEMEZING
Do you think there are many users with your current UC?

_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Ghislain ATEMEZING

2017-10-31 20:05:39 UTC

Hola,
Please donât get me wrong and donât give any interpretation based on my question.
Since the beginning of this thread, I am also trying to push the use of HDT here. For example, I was the one contacting HDT gurus to fix the dataset error on Twitter and so on...

Sorry if Laura or any one thought I was giving âsome lessons here â. I donât have a super computer either nor a member of Wikidata team. Just a âdata consumerâ as many here ..

Best,
Ghislain

Sent from my iPhone, may include typos

Post by Luigi Assom
Doh what's wrong with asking for supporting own user case "UC" ?
I think it is a totally legit question to ask, and that's why this thread exists.
Also, I do support for possibility to help access to data that would be hard to process from "common" hardware. Especially in the case of open data.
They exists to allow someone take them and build them - amazing if can prototype locally, right?
I don't like the use case where a data-scientist-or-IT show to the other data-scientist-or-IT own work looking for emotional support or praise.
I've seen that, not here, and I hope this attitude stays indeed out from here..
I do like when the work of data-scientist-or-IT ignites someone else's creativity - someone who is completely external - , to say: hey your work is cool and I wanna use it for... my use case!
That's how ideas go around and help other people build complexity over them, without constructing not necessary borders.
About a local version of compressed, index RDF - I think that if was available, more people yes probably would use it.

Post by Laura Morales
I feel like you are misrepresenting my request, and possibly trying to offend me as well.
My "UC" as you call it, is simply that I would like to have a local copy of wikidata, and query it using SPARQL. Everything that I've tried so far doesn't seem to work on commodity hardware since the database is so large. But HDT could work. So I asked if a HDT dump could, please, be added to other dumps that are periodically generated by wikidata. I also told you already that *I AM* trying to use the 1 year old dump, but in order to use the HDT tools I'm told that I *MUST* generate some other index first which unfortunately I can't generate for the same reasons that I can convert the Turtle to HDT. So what I was trying to say is, that if wikidata were to add any HDT dump, this dump should contain both the .hdt file and .hdt.index in order to be useful. That's about it, and it's not just about me. Anybody who wants to have a local copy of wikidata could benefit from this, since setting up a .hdt file seems much easier than a Turtle dump. And I don't understand why you're trying to blame me for this?
If you are part of the wikidata dev team, I'd greatly appreciate a "can/can't" or "don't care" response rather than playing the passive-aggressive game that you displayed in your last email.

_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Jérémie Roquet

2017-10-31 18:03:31 UTC

Post by Laura Morales
1. I have downloaded it and I'm trying to use it, but the HDT tools (eg. query) require to build an index before I can use the HDT file. I've tried to create the index, but I ran out of memory again (even though the index is smaller than the .hdt file itself). So any Wikidata dump should contain both the .hdt file and the .hdt.index file unless there is another way to generate the index on commodity hardware

I've just loaded the provided hdt file on a big machine (32 GiB wasn't
enough to build the index but ten times this is more than enough), so
here are a few interesting metrics:
- the index alone is ~14 GiB big uncompressed, ~9 GiB gzipped and
~6.5 GiB xzipped ;
- once loaded in hdtSearch, Wikidata uses ~36 GiB of virtual memory ;
- right after index generation, it includes ~16 GiB of anonymous
memory (with no memory pressure, that's ~26 GiB resident)…
- …but after a reload, the index is memory mapped as well, so it only
includes ~400 MiB of anonymous memory (and a mere ~1.2 GiB resident).

Looks like a good candidate for commodity hardware, indeed. It loads
in less than one second on a 32 GiB machine. I'll try to run a few
queries to see how it behaves.

FWIW, my use case is very similar to yours, as I'd like to run queries
that are too long for the public SPARQL endpoint and can't dedicate a
powerful machine do this full time (Blazegraph runs fine with 32 GiB,
though — it just takes a while to index and updating is not as fast as
the changes happening on wikidata.org).

--
Jérémie

Laura Morales

2017-10-31 20:27:14 UTC

Post by JÃ©rÃ©mie Roquet
I've just loaded the provided hdt file on a big machine (32 GiB wasn't

enough to build the index but ten times this is more than enough)

Could you please share a bit about your setup? Do you have a machine with 320GB of RAM?
Could you please also try to convert wikidata.ttl to hdt using "rdf2hdt"? I'd be interested to read your results on this too.
Thank you!

Post by JÃ©rÃ©mie Roquet
I'll try to run a few queries to see how it behaves.

I don't think there is a command-line tool to parse SPARQL queries, so you probably have to setup a Fuseki endpoint which uses HDT as a data source.

Jérémie Roquet

2017-10-31 20:58:40 UTC

Post by JÃ©rÃ©mie Roquet
I've just loaded the provided hdt file on a big machine (32 GiB wasn't

enough to build the index but ten times this is more than enough)
Could you please share a bit about your setup? Do you have a machine with 320GB of RAM?

It's a machine with 378 GiB of RAM and 64 threads running Scientific
Linux 7.2, that we use mainly for benchmarks.

Building the index was really all about memory because the CPUs have
actually a lower per-thread performance (2.30 GHz vs 3.5 GHz) compared
to those of my regular workstation, which was unable to build it.

Post by JÃ©rÃ©mie Roquet
Could you please also try to convert wikidata.ttl to hdt using "rdf2hdt"? I'd be interested to read your results on this too.

As I'm also looking for up-to-date results, so I plan do it with the
last turtle dump as soon as I have a time slot for it; I'll let you
know about the outcome.

Post by JÃ©rÃ©mie Roquet
I'll try to run a few queries to see how it behaves.

I don't think there is a command-line tool to parse SPARQL queries, so you probably have to setup a Fuseki endpoint which uses HDT as a data source.

You're right. The limited query language of hdtSearch is closer to
grep than to SPARQL.

Thank you for pointing out Fuseki, I'll have a look at it.

--
Jérémie

Laura Morales

2017-11-01 06:47:09 UTC

Post by JÃ©rÃ©mie Roquet
It's a machine with 378 GiB of RAM and 64 threads running Scientific
Linux 7.2, that we use mainly for benchmarks.
Building the index was really all about memory because the CPUs have
actually a lower per-thread performance (2.30 GHz vs 3.5 GHz) compared
to those of my regular workstation, which was unable to build it.

If your regular workstation was using more CPU, I guess it was because of swapping. Thanks for the statistics, it means a "commodity" CPU could handle this fine, the bottleneck is RAM. I wonder how expensive it is to buy a machine like yours... it sounds like in the $30K-$50K range?

Post by JÃ©rÃ©mie Roquet
You're right. The limited query language of hdtSearch is closer to
grep than to SPARQL.
Thank you for pointing out Fuseki, I'll have a look at it.

Jasper Koehorst

2017-11-01 06:52:57 UTC

We are actually planning to buy a new barebone server and they are around E2500,-. With barely any memory. I will check later with sales, 16gig ram strips are around max 200 euros so below 10K should be sufficient?

Post by JÃ©rÃ©mie Roquet
You're right. The limited query language of hdtSearch is closer to
grep than to SPARQL.
Thank you for pointing out Fuseki, I'll have a look at it.

I think a SPARQL command-line tool could exist, but AFAICT it doesn't exist (yet?). Anyway, I have already successfully setup Fuseki with a HDT backend, although my HDT files are all small. Feel free to drop me an email if you need any help setting up Fuseki.
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

2017-10-31 23:32:28 UTC

Hi!

Post by Laura Morales
OK. I wonder though, if it would be possible to setup a regular HDT
dump alongside the already regular dumps. Looking at the dumps page,
https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a
new dump is generated once a week more or less. So if a HDT dump
could

True, the dumps run weekly. "More or less" situation can arise only if
one of the dumps fail (either due to a bug or some sort of external
force majeure).

--
Stas Malyshev
***@wikimedia.org

sushil dutt

2017-11-01 05:05:41 UTC

Please take me out from these conversations.

Post by Stas Malyshev
Hi!

True, the dumps run weekly. "More or less" situation can arise only if
one of the dumps fail (either due to a bug or some sort of external
force majeure).
--
Stas Malyshev
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

--
Regards,
Sushil Dutt
8800911840

Laura Morales

2017-11-01 07:08:03 UTC

Post by sushil dutt
Please take me out from these conversations.

Sorry for the long thread, this is probably a small inconvenience with mailing list. However the "Subject" is always the same, you can delete messages right away without having to read them.

Jasper Koehorst

2017-11-01 06:49:19 UTC

Hello,

I am currently downloading the latest ttl file. On a 250gig ram machine. I will see if that is sufficient to run the conversion Otherwise we have another busy one with around 310 gig.
For querying I use the Jena query engine. I have created a module called HDTQuery located http://download.systemsbiology.nl/sapp/ <http://download.systemsbiology.nl/sapp/> which is a simple program and under development that should be able to use the full power of SPARQL and be more advanced than grepâŠ ;)

If this all works out I will see with our department if we can set up if it is still needed a weekly cron job to convert the TTL file. But as it is growing rapidly we might run into memory issues later?

Post by Stas Malyshev
Hi!

Laura Morales

2017-11-01 06:59:37 UTC

I am currently downloading the latest ttl file. On a 250gig ram machine. I will see if that is sufficient to run the conversion Otherwise we have another busy one with around 310 gig.

Thank you!

For querying I use the Jena query engine. I have created a module called HDTQuery located http://download.systemsbiology.nl/sapp/ which is a simple program and under development that should be able to use the full power of SPARQL and be more advanced than grep… ;)

Does this tool allow to query HDT files from command-line, with SPARQL, and without the need to setup a Fuseki endpoint?

If this all works out I will see with our department if we can set up if it is still needed a weekly cron job to convert the TTL file. But as it is growing rapidly we might run into memory issues later?

Thank you!

Jasper Koehorst

2017-11-01 07:00:52 UTC

Yes you just run it should get a sufficient help and if not… I am more than happy to polish the code…

java -jar /Users/jasperkoehorst/Downloads/HDTQuery.jar
The following option is required: -query
Usage: <main class> [options]
Options:
--help

-debug
Debug mode
Default: false
-e
SPARQL endpoint
-f
Output format, csv / tsv
Default: csv
-i
HDT input file(s) for querying (comma separated)
-o
Query result file
* -query
SPARQL Query or FILE containing the query to execute

* required parameter

Post by Jasper Koehorst
I am currently downloading the latest ttl file. On a 250gig ram machine. I will see if that is sufficient to run the conversion Otherwise we have another busy one with around 310 gig.

Thank you!

Post by Jasper Koehorst
For querying I use the Jena query engine. I have created a module called HDTQuery located http://download.systemsbiology.nl/sapp/ which is a simple program and under development that should be able to use the full power of SPARQL and be more advanced than grep… ;)

Does this tool allow to query HDT files from command-line, with SPARQL, and without the need to setup a Fuseki endpoint?

Post by Jasper Koehorst
If this all works out I will see with our department if we can set up if it is still needed a weekly cron job to convert the TTL file. But as it is growing rapidly we might run into memory issues later?

Thank you!

Osma Suominen

2017-11-02 13:22:18 UTC

Does this tool allow to query HDT files from command-line, with SPARQL, and without the need to setup a Fuseki endpoint?

There is also a command line tool called hdtsparql in the hdt-java
distribution that allows exactly this. It used to support only SELECT
queries, but I've enhanced it to support CONSTRUCT, DESCRIBE and ASK
queries too. There are some limitations, for example only CSV output is
supported for SELECT and N-Triples for CONSTRUCT and DESCRIBE. But it
works fine, at least for my use cases, and is often more convenient than
firing up Fuseki-HDT. It requires both the hdt file and the
corresponding index file.

Code here:
https://github.com/rdfhdt/hdt-java/blob/master/hdt-jena/src/main/java/org/rdfhdt/hdtjena/cmd/HDTSparql.java

The tool is in the hdt-jena package (not hdt-java-cli where the other
command line tools reside), since it uses parts of Jena (e.g. ARQ).
There is a wrapper script called hdtsparql.sh for executing it with the
proper Java environment.

Typical usage (example from hdt-java README):

# Execute SPARQL Query against the file.
$ ./hdtsparql.sh ../hdt-java/data/test.hdt "SELECT ?s ?p ?o WHERE { ?s
?p ?o . }"

-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
***@helsinki.fi
http://www.nationallibrary.fi

Laura Morales

2017-11-02 13:54:04 UTC

Post by Osma Suominen
There is also a command line tool called hdtsparql in the hdt-java

distribution that allows exactly this. It used to support only SELECT
queries, but I've enhanced it to support CONSTRUCT, DESCRIBE and ASK
queries too. There are some limitations, for example only CSV output is
supported for SELECT and N-Triples for CONSTRUCT and DESCRIBE.

Thank you for sharing.

Post by Osma Suominen
The tool is in the hdt-jena package (not hdt-java-cli where the other

command line tools reside), since it uses parts of Jena (e.g. ARQ).

Post by Osma Suominen
There is a wrapper script called hdtsparql.sh for executing it with the

proper Java environment.

Does this tool work nicely with large HDT files such as wikidata? Or does it need to load the whole graph+index into memory?

Osma Suominen

2017-11-02 13:59:50 UTC

Post by Osma Suominen

Post by Osma Suominen
The tool is in the hdt-jena package (not hdt-java-cli where the other

command line tools reside), since it uses parts of Jena (e.g. ARQ).

Post by Osma Suominen
There is a wrapper script called hdtsparql.sh for executing it with the

proper Java environment.
Does this tool work nicely with large HDT files such as wikidata? Or does it need to load the whole graph+index into memory?

I haven't tested it with huge datasets like Wikidata. But for the
moderately sized (40M triples) data that I use it for, it runs pretty
fast and without using lots of memory, so I think it just memory maps
the hdt and index file and reads only what it needs to answer the query.

-Osma

Laura Morales

2017-11-03 07:48:03 UTC

Hello list,

a very kind person from this list has generated the .hdt.index file for me, using the 1-year old wikidata HDT file available at the rdfhdt website. So I was finally able to setup a working local endpoint using HDT+Fuseki. Set up was easy, launch time (for Fuseki) also was quick (a few seconds), the only change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki startup script (btw I'm not very proficient in Java, so I hope this is the correct way). I've ran some queries too. Simple select or traversal queries seems fast to me (I haven't measured them but the response is almost immediate), other queries such as "select distinct ?class where { [] a ?class }" takes several seconds or a few minutes to complete, which kinda tells me the HDT indexes don't work well on all queries. But otherwise for simple queries it works perfectly! At least I'm able to query the dataset!
In conclusion, I think this more or less gives some positive feedback for using HDT on a "commodity computer", which means it can be very useful for people like me who want to use the dataset locally but who can't setup a full-blown server. If others want to try as well, they can offer more (hopefully positive) feedback.
For all of this, I heartwarmingly plea any wikidata dev to please consider scheduling a HDT dump (.hdt + .hdt.index) along with the other regular dumps that it creates weekly.

Thank you!!

Osma Suominen

2017-11-03 08:56:39 UTC

Hi Laura,

Thank you for sharing your experience! I think your example really shows
the power - and limitations - of HDT technology for querying very large
RDF data sets. While I don't currently have any use case for a local,
queryable Wikidata dump, I can easily see that it could be very useful
for doing e.g. resource-intensive, analytic queries. Having access to a
recent hdt+index dump of Wikidata would make it very easy to start doing
that. So I second your plea.

-Osma

Post by Laura Morales
Hello list,
a very kind person from this list has generated the .hdt.index file for me, using the 1-year old wikidata HDT file available at the rdfhdt website. So I was finally able to setup a working local endpoint using HDT+Fuseki. Set up was easy, launch time (for Fuseki) also was quick (a few seconds), the only change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki startup script (btw I'm not very proficient in Java, so I hope this is the correct way). I've ran some queries too. Simple select or traversal queries seems fast to me (I haven't measured them but the response is almost immediate), other queries such as "select distinct ?class where { [] a ?class }" takes several seconds or a few minutes to complete, which kinda tells me the HDT indexes don't work well on all queries. But otherwise for simple queries it works perfectly! At least I'm able to query the dataset!
In conclusion, I think this more or less gives some positive feedback for using HDT on a "commodity computer", which means it can be very useful for people like me who want to use the dataset locally but who can't setup a full-blown server. If others want to try as well, they can offer more (hopefully positive) feedback.
For all of this, I heartwarmingly plea any wikidata dev to please consider scheduling a HDT dump (.hdt + .hdt.index) along with the other regular dumps that it creates weekly.
Thank you!!
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Ettore RIZZA

2017-11-03 09:05:51 UTC

Thank you for this feedback, Laura.

Is the hdt index you got available somewhere on the cloud?

Cheers

Post by Ghislain ATEMEZING
Hi Laura,
Thank you for sharing your experience! I think your example really shows
the power - and limitations - of HDT technology for querying very large RDF
data sets. While I don't currently have any use case for a local, queryable
Wikidata dump, I can easily see that it could be very useful for doing e.g.
resource-intensive, analytic queries. Having access to a recent hdt+index
dump of Wikidata would make it very easy to start doing that. So I second
your plea.
-Osma

Post by Laura Morales
Hello list,
a very kind person from this list has generated the .hdt.index file for
me, using the 1-year old wikidata HDT file available at the rdfhdt website.
So I was finally able to setup a working local endpoint using HDT+Fuseki.
Set up was easy, launch time (for Fuseki) also was quick (a few seconds),
the only change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki
startup script (btw I'm not very proficient in Java, so I hope this is the
correct way). I've ran some queries too. Simple select or traversal queries
seems fast to me (I haven't measured them but the response is almost
immediate), other queries such as "select distinct ?class where { [] a
?class }" takes several seconds or a few minutes to complete, which kinda
tells me the HDT indexes don't work well on all queries. But otherwise for
simple queries it works perfectly! At least I'm able to query the dataset!
In conclusion, I think this more or less gives some positive feedback for
using HDT on a "commodity computer", which means it can be very useful for
people like me who want to use the dataset locally but who can't setup a
full-blown server. If others want to try as well, they can offer more
(hopefully positive) feedback.
For all of this, I heartwarmingly plea any wikidata dev to please
consider scheduling a HDT dump (.hdt + .hdt.index) along with the other
regular dumps that it creates weekly.
Thank you!!
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finlan
<https://maps.google.com/?q=y+of+Finlan&entry=gmail&source=g>d
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
http://www.nationallibrary.fi
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Jasper Koehorst

2017-11-03 09:15:12 UTC

I am uploading the index file temporarily to:

http://fungen.wur.nl/~jasperk/WikiData/ <http://fungen.wur.nl/~jasperk/WikiData/>

Jasper

Post by Ettore RIZZA
Thank you for this feedback, Laura.
Is the hdt index you got available somewhere on the cloud?
Cheers
Hi Laura,
Thank you for sharing your experience! I think your example really shows the power - and limitations - of HDT technology for querying very large RDF data sets. While I don't currently have any use case for a local, queryable Wikidata dump, I can easily see that it could be very useful for doing e.g. resource-intensive, analytic queries. Having access to a recent hdt+index dump of Wikidata would make it very easy to start doing that. So I second your plea.
-Osma
Hello list,
a very kind person from this list has generated the .hdt.index file for me, using the 1-year old wikidata HDT file available at the rdfhdt website. So I was finally able to setup a working local endpoint using HDT+Fuseki. Set up was easy, launch time (for Fuseki) also was quick (a few seconds), the only change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki startup script (btw I'm not very proficient in Java, so I hope this is the correct way). I've ran some queries too. Simple select or traversal queries seems fast to me (I haven't measured them but the response is almost immediate), other queries such as "select distinct ?class where { [] a ?class }" takes several seconds or a few minutes to complete, which kinda tells me the HDT indexes don't work well on all queries. But otherwise for simple queries it works perfectly! At least I'm able to query the dataset!
In conclusion, I think this more or less gives some positive feedback for using HDT on a "commodity computer", which means it can be very useful for people like me who want to use the dataset locally but who can't setup a full-blown server. If others want to try as well, they can offer more (hopefully positive) feedback.
For all of this, I heartwarmingly plea any wikidata dev to please consider scheduling a HDT dump (.hdt + .hdt.index) along with the other regular dumps that it creates weekly.
Thank you!!
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finlan <https://maps.google.com/?q=y+of+Finlan&entry=gmail&source=g>d
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529 <tel:%2B358%2050%203199529>
http://www.nationallibrary.fi <http://www.nationallibrary.fi/>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Ettore RIZZA

2017-11-03 09:16:49 UTC

Thank you very much, Jasper !

Post by Jasper Koehorst
http://fungen.wur.nl/~jasperk/WikiData/
Jasper
Thank you for this feedback, Laura.
Is the hdt index you got available somewhere on the cloud?
Cheers

Post by Laura Morales
Hello list,
a very kind person from this list has generated the .hdt.index file for
me, using the 1-year old wikidata HDT file available at the rdfhdt website.
So I was finally able to setup a working local endpoint using HDT+Fuseki.
Set up was easy, launch time (for Fuseki) also was quick (a few seconds),
the only change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki
startup script (btw I'm not very proficient in Java, so I hope this is the
correct way). I've ran some queries too. Simple select or traversal queries
seems fast to me (I haven't measured them but the response is almost
immediate), other queries such as "select distinct ?class where { [] a
?class }" takes several seconds or a few minutes to complete, which kinda
tells me the HDT indexes don't work well on all queries. But otherwise for
simple queries it works perfectly! At least I'm able to query the dataset!
In conclusion, I think this more or less gives some positive feedback
for using HDT on a "commodity computer", which means it can be very useful
for people like me who want to use the dataset locally but who can't setup
a full-blown server. If others want to try as well, they can offer more
(hopefully positive) feedback.
For all of this, I heartwarmingly plea any wikidata dev to please
consider scheduling a HDT dump (.hdt + .hdt.index) along with the other
regular dumps that it creates weekly.
Thank you!!
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

2017-11-03 09:29:12 UTC

Thank you for this feedback, Laura.
Is the hdt index you got available somewhere on the cloud?

Unfortunately it's not. It was a private link that was temporarily shared with me by email. I guess I could re-upload the file somewhere else myself, but my uplink is really slow (1Mbps).

Lucas Werkmeister

2017-11-03 12:44:38 UTC

Iâve created a Phabricator task (https://phabricator.wikimedia.org/T179681)
for providing a HDT dump, letâs see if someone else (ideally from the ops
team) responds to it. (Iâm not familiar with the systems we currently use
for the dumps, so I canât say if they have enough resources for this.)

Cheers,
Lucas

Post by Ettore RIZZA
Thank you for this feedback, Laura.
Is the hdt index you got available somewhere on the cloud?

Unfortunately it's not. It was a private link that was temporarily shared
with me by email. I guess I could re-upload the file somewhere else myself,
but my uplink is really slow (1Mbps).
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

--
Lucas Werkmeister
Software Developer (Intern)

Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
https://wikimedia.de

Imagine a world, in which every single human being can freely share in the
sum of all knowledge. Thatâs our commitment.

Wikimedia Deutschland - Gesellschaft zur FÃ¶rderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnÃŒtzig anerkannt durch das Finanzamt fÃŒr
KÃ¶rperschaften I Berlin, Steuernummer 27/029/42207.

Laura Morales

2017-11-03 13:02:21 UTC

I’ve created a Phabricator task (https://phabricator.wikimedia.org/T179681) for providing a HDT dump, let’s see if someone else (ideally from the ops team) responds to it. (I’m not familiar with the systems we currently use for the dumps, so I can’t say if they have enough resources for this.)

Thank you Lucas!

Jérémie Roquet

2017-11-07 14:24:38 UTC

Hi everyone,

I'm afraid the current implementation of HDT is not ready to handle
more than 4 billions triples as it is limited to 32 bit indexes. I've
opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135

Until this is addressed, don't waste your time trying to convert the
entire Wikidata to HDT: it can't work.

--
Jérémie

Ghislain ATEMEZING

2017-11-07 14:32:38 UTC

Hi Jeremie,
Thanks for this info.
In the meantime, what about making chunks of 3.5Bio triples (or any size less than 4Bio) and a script to convert the dataset? Would that be possible ?

Best,
Ghislain

ProvenanceÂ : Courrier pour Windows 10

DeÂ : JÃ©rÃ©mie Roquet
EnvoyÃ© leÂ :mardi 7 novembre 2017 15:25
ÃÂ : Discussion list for the Wikidata project.
ObjetÂ :Re: [Wikidata] Wikidata HDT dump

Hi everyone,

I'm afraid the current implementation of HDT is not ready to handle
more than 4 billions triples as it is limited to 32 bit indexes. I've
opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135

Until this is addressed, don't waste your time trying to convert the
entire Wikidata to HDT: it can't work.

--
JÃ©rÃ©mie

Jérémie Roquet

2017-11-07 16:56:24 UTC

Post by Ghislain ATEMEZING
In the meantime, what about making chunks of 3.5Bio triples (or any size
less than 4Bio) and a script to convert the dataset? Would that be possible?

That seems possible to me, but I wonder if cutting the dataset in
independent clusters is not a bigger undertaking compared to making
HDT handle bigger datasets (I'm not saying it is, I've really no
idea).

Best regards,

--
Jérémie

Ettore RIZZA

2018-10-01 07:10:13 UTC

Hello,

a new dump of Wikidata in HDT (with index) is available
<http://www.rdfhdt.org/datasets/>. You will see how Wikidata has become
huge compared to other datasets. it contains about twice the limit of 4B
triples discussed above.

In this regard, what is in 2018 the most user friendly way to use this
format?

BR,

Ettore

On Tue, 7 Nov 2017 at 15:33, Ghislain ATEMEZING <

Post by Ghislain ATEMEZING
Hi Jeremie,
Thanks for this info.
In the meantime, what about making chunks of 3.5Bio triples (or any size
less than 4Bio) and a script to convert the dataset? Would that be possible
?
Best,
Ghislain
Provenance : Courrier <https://go.microsoft.com/fwlink/?LinkId=550986>
pour Windows 10
*EnvoyÃ© le :*mardi 7 novembre 2017 15:25
*Ã : *Discussion list for the Wikidata project.
*Objet :*Re: [Wikidata] Wikidata HDT dump
Hi everyone,
I'm afraid the current implementation of HDT is not ready to handle
more than 4 billions triples as it is limited to 32 bit indexes. I've
opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135
Until this is addressed, don't waste your time trying to convert the
entire Wikidata to HDT: it can't work.
--
JÃ©rÃ©mie
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

2018-10-01 16:59:38 UTC

a new dump of Wikidata in HDT (with index) is available[http://www.rdfhdt.org/datasets/].

Thank you very much! Keep it up!
Out of curiosity, what computer did you use for this? IIRC it required >512GB of RAM to function.

You will see how Wikidata has become huge compared to other datasets. it contains about twice the limit of 4B triples discussed above.

There is a 64-bit version of HDT that doesn't have this limitation of 4B triples.

In this regard, what is in 2018 the most user friendly way to use this format?

Speaking for me at least, Fuseki with a HDT store. But I know there are also some CLI tools from the HDT folks.

Ettore RIZZA

2018-10-01 21:03:59 UTC

Post by Laura Morales
what computer did you use for this? IIRC it required >512GB of RAM to

function.

Hello Laura,

Sorry for my confusing message, I am not at all a member of the HDT team.
But according to its creator
<https://twitter.com/ciutti/status/1046849607114936320>, 100 GB "with an
optimized code" could be enough to produce an HDT like that.

a new dump of Wikidata in HDT (with index) is available[

http://www.rdfhdt.org/datasets/].
Thank you very much! Keep it up!
Out of curiosity, what computer did you use for this? IIRC it required

512GB of RAM to function.
You will see how Wikidata has become huge compared to other datasets. it

contains about twice the limit of 4B triples discussed above.
There is a 64-bit version of HDT that doesn't have this limitation of 4B triples.

In this regard, what is in 2018 the most user friendly way to use this

format?
Speaking for me at least, Fuseki with a HDT store. But I know there are
also some CLI tools from the HDT folks.
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Paul Houle

2018-10-01 23:32:03 UTC

You shouldn't have to keep anything in RAM to HDT-ize something as you
could make the dictionary by sorting on disk and also do the joins to
look up everything against the dictionary by sorting.

------ Original Message ------
From: "Ettore RIZZA" <***@gmail.com>
To: "Discussion list for the Wikidata project."
<***@lists.wikimedia.org>
Sent: 10/1/2018 5:03:59 PM
Subject: Re: [Wikidata] Wikidata HDT dump

Post by Laura Morales
what computer did you use for this? IIRC it required >512GB of RAM to

function.
Hello Laura,
Sorry for my confusing message, I am not at all a member of the HDT
team. But according to its creator
<https://twitter.com/ciutti/status/1046849607114936320>, 100 GB "with
an optimized code" could be enough to produce an HDT like that.

Post by Ettore RIZZA
a new dump of Wikidata in HDT (with index) is

available[http://www.rdfhdt.org/datasets/].
Thank you very much! Keep it up!
Out of curiosity, what computer did you use for this? IIRC it required

Post by Ettore RIZZA
512GB of RAM to function.
You will see how Wikidata has become huge compared to other

datasets. it contains about twice the limit of 4B triples discussed
above.
There is a 64-bit version of HDT that doesn't have this limitation of 4B triples.

Post by Ettore RIZZA
In this regard, what is in 2018 the most user friendly way to use

this format?
Speaking for me at least, Fuseki with a HDT store. But I know there
are also some CLI tools from the HDT folks.
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

2018-10-02 23:09:59 UTC

You shouldn't have to keep anything in RAM to HDT-ize something as you could make the dictionary by sorting on disk and also do the joins to look up everything against the dictionary by sorting.

Yes but somebody has to write the code for it :)
My understanding is that they keep everything in memory because it was simpler to develop. The problem is that graphs can become really huge so this approach clearly doesn't scale too well.

Laura Morales

2018-10-02 23:05:19 UTC

100 GB "with an optimized code" could be enough to produce an HDT like that.

The current software definitely cannot handle wikidata with 100GB. It was tried before and it failed.
I'm glad to see that new code will be released to handle large files. After skimming that paper it looks like they split the RDF source into multiple files and "cat" them into a single HDT file. 100GB is still a pretty large footprint, but I'm so glad that they're working on this. A 128GB server is *way* more affordable than one with 512GB or 1TB!

I can't wait to try the new code myself.

Laura Morales

2017-11-07 16:09:27 UTC

How many triples does wikidata have? The old dump from rdfhdt seem to have about 2 billion, which means wikidata doubled the number of triples in less than a year?

Sent: Tuesday, November 07, 2017 at 3:24 PM
From: "Jérémie Roquet" <***@arkanosis.net>
To: "Discussion list for the Wikidata project." <***@lists.wikimedia.org>
Subject: Re: [Wikidata] Wikidata HDT dump
Hi everyone,

I'm afraid the current implementation of HDT is not ready to handle
more than 4 billions triples as it is limited to 32 bit indexes. I've
opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135

Until this is addressed, don't waste your time trying to convert the
entire Wikidata to HDT: it can't work.

--
Jérémie

_______________________________________________
Wikidata mailing list
***@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]

Lucas Werkmeister

2017-11-07 16:21:50 UTC

The Wikidata Query Service currently holds some 3.8 billion triples –
you can see the numbers on Grafana [1]. But WDQS “munges” the dump
before importing it – for instance, it merges wdata:… into wd:… and
drops `a wikibase:Item` and `a wikibase:Statement` types; see [2] for
details – so the triple count in the un-munged dump will be somewhat
larger than the triple count in WDQS.

Cheers,
Lucas

[1]:
https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?panelId=7&fullscreen
[2]:
https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#WDQS_data_differences

Post by Laura Morales
How many triples does wikidata have? The old dump from rdfhdt seem to have about 2 billion, which means wikidata doubled the number of triples in less than a year?

Sent: Tuesday, November 07, 2017 at 3:24 PM
Subject: Re: [Wikidata] Wikidata HDT dump
Hi everyone,
I'm afraid the current implementation of HDT is not ready to handle
more than 4 billions triples as it is limited to 32 bit indexes. I've
opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135
Until this is addressed, don't waste your time trying to convert the
entire Wikidata to HDT: it can't work.
--
Jérémie
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

2017-11-07 17:31:33 UTC

Post by Lucas Werkmeister
drops `a wikibase:Item` and `a wikibase:Statement` types

off topic but... why drop `a wikibase:Item`? Without this it seems impossible to retrieve a list of items.

Wouter Beek

2017-12-12 10:24:05 UTC

Hi list,

I'm sorry, I was under the impression that I had already shared this
resource with you earlier, but I haven't...

On 7 Nov I created an HDT file based on the then current download link
from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz

You can download this HDT file and it's index from the following locations:
- http://lod-a-lot.lod.labs.vu.nl/data/wikidata.hdt (~45GB)
- http://lod-a-lot.lod.labs.vu.nl/data/wikidata.hdt.index.v1-1 (~28GB)

You may need to compile with 64bit support, because there are more
than 2B triples (https://github.com/rdfhdt/hdt-cpp/tree/develop-64).
(To be exact, there are 4,579,973,187 triples in this file.)

PS: If this resource turns out to be useful to the community we can
offer an updated HDT file at a to be determined interval.

---
Cheers,
Wouter Beek.

Email: ***@triply.cc
WWW: http://triply.cc
Tel: +31647674624

Post by Lucas Werkmeister
drops `a wikibase:Item` and `a wikibase:Statement` types

off topic but... why drop `a wikibase:Item`? Without this it seems impossible to retrieve a list of items.
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

2017-12-12 11:58:09 UTC

* T H A N K Y O U *

Post by Wouter Beek
On 7 Nov I created an HDT file based on the then current download link
from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz

Thank you very very much Wouter!! This is great!
Out of curiosity, could you please share some info about the machine that you've used to generate these files? In particular I mean hardware info, such as the model names of mobo/cpu/ram/disks. Also "how long" it took to generate these files would be an interesting information.

Post by Wouter Beek
PS: If this resource turns out to be useful to the community we can
offer an updated HDT file at a to be determined interval.

This would be fantastic! Wikidata dumps about once a week, so I think even a new HDT file every 1-2 months would be awesome.
Related to this however... why not use the Laundromat for this? There are several datasets that are very large, and rdf2hdt is really expensive to run. Maybe you could schedule regular jobs for several graphs (wikidata, dbpedia, wordnet, linkedgeodata, government data, ...) and make them available at the Laundromat?

* T H A N K Y O U *

Wouter Beek

2017-12-15 14:52:44 UTC

Hi Wikidata community,

Somebody pointed me to the following issue:
https://phabricator.wikimedia.org/T179681 Unfortunately I'm not able
to log in there with the "Phabricator" so I cannot edit the issue
directly. I'm sending this email instead.

The issue seems to be stalled because it is not possible to create HDT
files that contain more than 2B triples. However, this is possible in
a specific 64 bit branch, which is how I created the downloadable
version I've sent a few days ago. As indicated, I can create these
files for the community if there is a use case.

---
Cheers,
Wouter.

Email: ***@triply.cc
WWW: http://triply.cc
Tel: +31647674624

Post by Wouter Beek
Hi list,
I'm sorry, I was under the impression that I had already shared this
resource with you earlier, but I haven't...
On 7 Nov I created an HDT file based on the then current download link
from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
- http://lod-a-lot.lod.labs.vu.nl/data/wikidata.hdt (~45GB)
- http://lod-a-lot.lod.labs.vu.nl/data/wikidata.hdt.index.v1-1 (~28GB)
You may need to compile with 64bit support, because there are more
than 2B triples (https://github.com/rdfhdt/hdt-cpp/tree/develop-64).
(To be exact, there are 4,579,973,187 triples in this file.)
PS: If this resource turns out to be useful to the community we can
offer an updated HDT file at a to be determined interval.
---
Cheers,
Wouter Beek.
WWW: http://triply.cc
Tel: +31647674624

Post by Lucas Werkmeister
drops `a wikibase:Item` and `a wikibase:Statement` types

Stas Malyshev

2017-12-16 00:44:36 UTC

Hi!

Post by Wouter Beek
https://phabricator.wikimedia.org/T179681 Unfortunately I'm not able
to log in there with the "Phabricator" so I cannot edit the issue
directly. I'm sending this email instead.

Thank you, I've updated the task with references to your comments.

--
Stas Malyshev
***@wikimedia.org

Jérémie Roquet

2017-11-07 16:48:59 UTC

Post by Laura Morales
How many triples does wikidata have? The old dump from rdfhdt seem to have about 2 billion, which means wikidata doubled the number of triples in less than a year?

A naive grep | wc -l on the last turtle dump gives me an estimate of
4.65 billions triples.

Looking at https://tools.wmflabs.org/wikidata-todo/stats.php it seems
that Wikidata is indeed more than twice as big as only six months ago.

--
Jérémie

Laura Morales

2017-10-27 15:35:26 UTC