Discussion:
Kickstartet: Adding 2.2 million German organisations to Wikidata
(too old to reply)
Sebastian Hellmann
2017-10-15 07:44:45 UTC
Permalink
Hi all,

the German business registry contains roughly 2.2 million organisations.
Some information is paid, but other is public, i.e. the info you are
searching for at and clicking on UT (see example below):

https://www.handelsregister.de/rp_web/mask.do?Typ=e


I would like to add this to Wikidata, either by crawling or by raising
money to use crowdsourcing concepts like crowdflour or amazon turk.


It should meet notability criteria 2:
https://www.wikidata.org/wiki/Wikidata:Notability
2. It refers to an instance of a *clearly identifiable conceptual or
material entity*. The entity must be notable, in the sense that it
*can be described using serious and publicly available references*. If
there is no item about you yet, you are probably not notable.
The reference is the official German business registry, which is serious
and public. Orgs are also per definition clearly identifiable legal
entities.

How can I get clearance to proceed on this?

All the best,
Sebastian



Entity data


Saxony District court *Leipzig HRB 32853 *– A&A
Dienstleistungsgesellschaft mbH
Legal status: Gesellschaft mit beschrÀnkter Haftung
Capital: 25.000,00 EUR
Date of entry: 29/08/2016
(When entering date of entry, wrong data input can occur due to system
failures!)
Date of removal: -
Balance sheet available: -
Address (subject to correction): A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig
--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
Yaroslav Blanter
2017-10-15 20:10:25 UTC
Permalink
Hi Sebastian,

I would say the best way is to file a request for the permissions for the
bot

https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot

and possibly leave a message on the Project Chat

https://www.wikidata.org/wiki/Wikidata:Project_chat

Cheers
Yaroslav

On Sun, Oct 15, 2017 at 9:44 AM, Sebastian Hellmann <
Post by Sebastian Hellmann
Hi all,
the German business registry contains roughly 2.2 million organisations.
Some information is paid, but other is public, i.e. the info you are
https://www.handelsregister.de/rp_web/mask.do?Typ=e
I would like to add this to Wikidata, either by crawling or by raising
money to use crowdsourcing concepts like crowdflour or amazon turk.
It should meet notability criteria 2: https://www.wikidata.org/wiki/
Wikidata:Notability
2. It refers to an instance of a *clearly identifiable conceptual or
material entity*. The entity must be notable, in the sense that it *can
be described using serious and publicly available references*. If there
is no item about you yet, you are probably not notable.
The reference is the official German business registry, which is serious
and public. Orgs are also per definition clearly identifiable legal
entities.
How can I get clearance to proceed on this?
All the best,
Sebastian
Entity data
Saxony District court *Leipzig HRB 32853 *– A&A
Dienstleistungsgesellschaft mbH
Legal status: Gesellschaft mit beschrÀnkter Haftung
Capital: 25.000,00 EUR
Date of entry: 29/08/2016
(When entering date of entry, wrong data input can occur due to system
failures!)
Date of removal: -
Balance sheet available: -
Address (subject to correction): A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
Sebastian Hellmann
2017-10-16 07:48:08 UTC
Permalink
Thanks, done.

https://www.wikidata.org/wiki/Wikidata:Project_chat#Handelsregister
Post by Yaroslav Blanter
Hi Sebastian,
I would say the best way is to file a request for the permissions for
the bot
https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot
and possibly leave a message on the Project Chat
https://www.wikidata.org/wiki/Wikidata:Project_chat
Cheers
Yaroslav
On Sun, Oct 15, 2017 at 9:44 AM, Sebastian Hellmann
Hi all,
the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e.
the info you are searching for at and clicking on UT (see example
https://www.handelsregister.de/rp_web/mask.do?Typ=e
<https://www.handelsregister.de/rp_web/mask.do?Typ=e>
I would like to add this to Wikidata, either by crawling or by
raising money to use crowdsourcing concepts like crowdflour or
amazon turk.
https://www.wikidata.org/wiki/Wikidata:Notability
<https://www.wikidata.org/wiki/Wikidata:Notability>
2. It refers to an instance of a *clearly identifiable conceptual
or material entity*. The entity must be notable, in the sense
that it *can be described using serious and publicly available
references*. If there is no item about you yet, you are probably
not notable.
The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly
identifiable legal entities.
How can I get clearance to proceed on this?
All the best,
Sebastian
Entity data
Saxony District court *Leipzig HRB 32853 *– A&A
Dienstleistungsgesellschaft mbH
Legal status: Gesellschaft mit beschrÀnkter Haftung
Capital: 25.000,00 EUR
Date of entry: 29/08/2016
(When entering date of entry, wrong data input can occur due to
system failures!)
Date of removal: -
Balance sheet available: -
Address (subject to correction): A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
Andra Waagmeester
2017-10-16 08:25:12 UTC
Permalink
There is an equal size of data on Belgian enterprises available. with the
same objective to enrich wikidata with enterprise data I recently proposed
the following property:
https://www.wikidata.org/wiki/Wikidata:Property_proposal/NACE_code

However, after some talks with others in the Wikidata community, I recently
have some second thoughts on whether or not a full dump of these type of
datasets are valuable enrichments of Wikidata. Adding 2 million items with
additional statement per item would be quite an enlargement of Wikidata. If
we would bot add all business of both Belgium and Germany, we would have 4
million of new items, which currently would count for 10% of all of
Wikidata. I am not sure what this would mean in term scalability and if it
would cause any scalability issues.

Maybe a use-case driven approach here would be more appropriate. We could
think of a bot that would source both the trade registers of the different
countries when a specific use case would vouch for the inclusion of trade
data.

Just my 2cts

Cheers,

Andra

On Mon, Oct 16, 2017 at 9:48 AM, Sebastian Hellmann <
Post by Sebastian Hellmann
Thanks, done.
https://www.wikidata.org/wiki/Wikidata:Project_chat#Handelsregister
Hi Sebastian,
I would say the best way is to file a request for the permissions for the
bot
https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot
and possibly leave a message on the Project Chat
https://www.wikidata.org/wiki/Wikidata:Project_chat
Cheers
Yaroslav
On Sun, Oct 15, 2017 at 9:44 AM, Sebastian Hellmann <
Post by Sebastian Hellmann
Hi all,
the German business registry contains roughly 2.2 million organisations.
Some information is paid, but other is public, i.e. the info you are
https://www.handelsregister.de/rp_web/mask.do?Typ=e
I would like to add this to Wikidata, either by crawling or by raising
money to use crowdsourcing concepts like crowdflour or amazon turk.
It should meet notability criteria 2: https://www.wikidata.org/wiki/
Wikidata:Notability
2. It refers to an instance of a *clearly identifiable conceptual or
material entity*. The entity must be notable, in the sense that it *can
be described using serious and publicly available references*. If there
is no item about you yet, you are probably not notable.
The reference is the official German business registry, which is serious
and public. Orgs are also per definition clearly identifiable legal
entities.
How can I get clearance to proceed on this?
All the best,
Sebastian
Entity data
Saxony District court *Leipzig HRB 32853 *– A&A
Dienstleistungsgesellschaft mbH
Legal status: Gesellschaft mit beschrÀnkter Haftung
Capital: 25.000,00 EUR
Date of entry: 29/08/2016
(When entering date of entry, wrong data input can occur due to system
failures!)
Date of removal: -
Balance sheet available: -
Address (subject to correction): A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
Federico Morando
2017-10-16 10:47:28 UTC
Permalink
Dear All,

although in Italy these data are normally not available (not even the basic
data) from the chambers of commerce, there are some open data from which we
could extract several identifiers - of course these are biased toward the
suppliers of Public Administrations, because contracting with PA is the
trigger for being listed in these Open Data.

In the context of a broader effort to upload this kind of data in Wikidata,
as the one which seems to emerge from this thread, the firm which I manage
may be willing to contribute about half a million couples of labels and VAT
IDs... it's a relatively thin dataset - in the sense that you just have the
name of the firm and the VAT ID, and possibly a link to a portal we're
building in which you may gather additional information about the activity
of this firm with the Italian public administration - but, as I was
mentioning, Italian firm data are quite rare (they are not even available
on OpenCorporates.com).

By the way, https://www.wikidata.org/wiki/Property:P3608 (EU VAT number)
already exists and may provide a sufficient identifier in most cases, since
in most cases the country ISO code (e.g. IT for Italy) + the national VAT
ID does generated the EU VAT number (the actual algorithm may be a bit more
complex, but it's documented). (That said, there are also national
identifiers which may be worth creating, such as the number of registration
at national chambers of commerce, etc.)

About the value of these data on Wikidata, starting from our use case, I
think that having permanent URIs for all firms on Wikidata would provide,
for instance, great value for several anti-corruption projects around the
world. (This could also provide a place to trace some international links
among companies, which are not always readily available today.) That said,
I perfectly understand the concerns of Andra in terms of scalability and
maintenance, and this is one of the reasons I did not think of donating
these data to Wikidata so far.

I'll try to follow these discussions, but please - Sebastian or others -
feel free to ping me if the project goes on and you want to include these
Italian data.

Best,

Federico
Post by Andra Waagmeester
There is an equal size of data on Belgian enterprises available. with the
same objective to enrich wikidata with enterprise data I recently proposed
Property_proposal/NACE_code
However, after some talks with others in the Wikidata community, I
recently have some second thoughts on whether or not a full dump of these
type of datasets are valuable enrichments of Wikidata. Adding 2 million
items with additional statement per item would be quite an enlargement of
Wikidata. If we would bot add all business of both Belgium and Germany, we
would have 4 million of new items, which currently would count for 10% of
all of Wikidata. I am not sure what this would mean in term scalability and
if it would cause any scalability issues.
Maybe a use-case driven approach here would be more appropriate. We could
think of a bot that would source both the trade registers of the different
countries when a specific use case would vouch for the inclusion of trade
data.
Just my 2cts
Cheers,
Andra
On Mon, Oct 16, 2017 at 9:48 AM, Sebastian Hellmann <
Post by Sebastian Hellmann
Thanks, done.
https://www.wikidata.org/wiki/Wikidata:Project_chat#Handelsregister
Hi Sebastian,
I would say the best way is to file a request for the permissions for the
bot
https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot
and possibly leave a message on the Project Chat
https://www.wikidata.org/wiki/Wikidata:Project_chat
Cheers
Yaroslav
On Sun, Oct 15, 2017 at 9:44 AM, Sebastian Hellmann <
Post by Sebastian Hellmann
Hi all,
the German business registry contains roughly 2.2 million organisations.
Some information is paid, but other is public, i.e. the info you are
https://www.handelsregister.de/rp_web/mask.do?Typ=e
I would like to add this to Wikidata, either by crawling or by raising
money to use crowdsourcing concepts like crowdflour or amazon turk.
It should meet notability criteria 2: https://www.wikidata.org/wiki/
Wikidata:Notability
2. It refers to an instance of a *clearly identifiable conceptual or
material entity*. The entity must be notable, in the sense that it *can
be described using serious and publicly available references*. If there
is no item about you yet, you are probably not notable.
The reference is the official German business registry, which is serious
and public. Orgs are also per definition clearly identifiable legal
entities.
How can I get clearance to proceed on this?
All the best,
Sebastian
Entity data
Saxony District court *Leipzig HRB 32853 *– A&A
Dienstleistungsgesellschaft mbH
Legal status: Gesellschaft mit beschrÀnkter Haftung
Capital: 25.000,00 EUR
Date of entry: 29/08/2016
(When entering date of entry, wrong data input can occur due to system
failures!)
Date of removal: -
Balance sheet available: -
Address (subject to correction): A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
Federico Leva (Nemo)
2017-10-15 20:15:41 UTC
Permalink
This is an area where I would very much like to see some important
properties created and populated, to the benefit e.g. of various
infoboxes on Wikipedias which contain data in need of frequent updates
(especially income, revenue, market capitalization, number of employees,
links to most recent financial statements and other corporate information).
https://www.wikidata.org/w/index.php?title=Wikidata:Property_proposal/Organization&oldid=307430401
https://www.wikidata.org/wiki/Wikidata:List_of_properties/Organization

Even data for companies listed in stock exchanges is terribly outdated
most of the times.

Nemo
Neubert, Joachim
2017-10-16 08:18:12 UTC
Permalink
Hi Sebastian,

This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.

A good way for such an approach would be proposing a property for an external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably with location and/or sector of trade in the description field.

I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:


- conducting an extended search for the exact string “A&A Dienstleistungsgesellschaft mbH“

- copying the register number “32853” plus selecting the court (Leipzig) from the according dropdown list and search that

Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much more so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the information they have (common name, location, perhaps founding date, and most important sector of trade), so disambiguation may still be difficult.

Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as Wikidata?

Cheers, Joachim
--
Joachim Neubert

ZBW – German National Library of Economics
Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-42834-462



Von: Wikidata [mailto:wikidata-***@lists.wikimedia.org] Im Auftrag von Sebastian Hellmann
Gesendet: Sonntag, 15. Oktober 2017 09:45
An: ***@lists.wikimedia.org<mailto:***@lists.wikimedia.org>
Betreff: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata


Hi all,

the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example below):

https://www.handelsregister.de/rp_web/mask.do?Typ=e



I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or amazon turk.



It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability

2. It refers to an instance of a clearly identifiable conceptual or material entity. The entity must be notable, in the sense that it can be described using serious and publicly available references. If there is no item about you yet, you are probably not notable.

The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities.

How can I get clearance to proceed on this?

All the best,
Sebastian





Entity data

Saxony District court Leipzig HRB 32853 – A&A Dienstleistungsgesellschaft mbH

Legal status:

Gesellschaft mit beschrÀnkter Haftung

Capital:

25.000,00 EUR

Date of entry:

29/08/2016
(When entering date of entry, wrong data input can occur due to system failures!)

Date of removal:

-

Balance sheet available:

-

Address (subject to correction):

A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig
--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
Sebastian Hellmann
2017-10-16 11:41:50 UTC
Permalink
Hi all,

the technical challenges are not so difficult.

- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.

- Wikidata has 40k organisations:

https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A
%3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}

so there would be a maximum of 40k duplicates These are easy to find and
deduplicate

- The crawl can be done easily, a colleague has done so before.


The issues here are:

- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead

- If the data were available externally as structured data under open
license, I would probably not suggest loading it into wikidata, as the
data can be retrieved from the official source directly, however, here
this data will not be published in a decent format.

I thought that the way data is copied from coyrighted sources, i.e. only
facts is ok for wikidata. This done in a lot of places, I guess. Same
for Wikipedia, i.e. News articles and copyrighted books are referenced.
So Wikimedia or the Wikimedia community are experts on this.

All the best,

Sebastian
Post by Yaroslav Blanter
Hi Sebastian,
This is huge! It will cover almost all currently existing German
companies. Many of these will have similar names, so preparing for
disambiguation is a concern.
A good way for such an approach would be proposing a property for an
external identifier, loading the data into Mix-n-match, creating links
for companies already in Wikidata, and adding the rest (or perhaps
only parts of them - I’m not sure if having all of them in Wikidata
makes sense, but that’s another discussion), preferably with location
and/or sector of trade in the description field.
I’ve tried to figure out what could be used as key for a external
identifier property. However, it looks like the registry does not
offer any (persistent) URL to its entries. So for looking up a
-conducting an extended search for the exact string “A&A
Dienstleistungsgesellschaft mbH“
-copying the register number “32853” plus selecting the court
(Leipzig) from the according dropdown list and search that
Both ways are not very intuitive, even if we can provide a link to the
search form. This would make a weak connection to the source of
information. Much more important, it makes disambiguation in
Mix-n-match difficult. This applies for the preparation of your
initial load (you would not want to create duplicates). But much more
so for everybody else who wants to match his or her data later on.
Being forced to search for entries manually in a cumbersome way for
disambiguation of a new, possibly large and rich dataset is, in my
eyes, not something we want to impose on future contributors. And
often, the free information they find in the registry (formal name,
register number, legal form, address) will not easily match with the
information they have (common name, location, perhaps founding date,
and most important sector of trade), so disambiguation may still be
difficult.
Have you checked which parts of the accessible information as below
can be crawled and added legally to external databases such as Wikidata?
Cheers, Joachim
--
Joachim Neubert
ZBW – German National Library of Economics
Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-42834-462
Auftrag von *Sebastian Hellmann
*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
organisations to Wikidata
Hi all,
the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e. the
https://www.handelsregister.de/rp_web/mask.do?Typ=e
I would like to add this to Wikidata, either by crawling or by raising
money to use crowdsourcing concepts like crowdflour or amazon turk.
https://www.wikidata.org/wiki/Wikidata:Notability
2. It refers to an instance of a *clearly identifiable conceptual
or material entity*. The entity must be notable, in the sense that
it *can be described using serious and publicly available
references*. If there is no item about you yet, you are probably
not notable.
The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly identifiable
legal entities.
How can I get clearance to proceed on this?
All the best,
Sebastian
Entity data
Saxony District court *Leipzig HRB 32853 * – A&A
Dienstleistungsgesellschaft mbH
Gesellschaft mit beschrÀnkter Haftung
25.000,00 EUR
29/08/2016
(When entering date of entry, wrong data input can occur due to system failures!)
-
-
A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
Sebastian Hellmann
2017-10-16 12:24:51 UTC
Permalink
Ah yes, forgot to mention:

there is no URI or unique identifier given by the Handelsregister
system. However, the courts take care that the registrations are unique,
so it is implicit. Handelsregister could easily create stable URIs out
of the court+type+number like /Leipzig_HRB_32853

For Wikidata this is not a problem to handle. So no technical issues
from this side either.

All the best,

Sebastian
Post by Sebastian Hellmann
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find
and deduplicate
- The crawl can be done easily, a colleague has done so before.
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under open
license, I would probably not suggest loading it into wikidata, as the
data can be retrieved from the official source directly, however, here
this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e.
only facts is ok for wikidata. This done in a lot of places, I guess.
Same for Wikipedia, i.e. News articles and copyrighted books are
referenced. So Wikimedia or the Wikimedia community are experts on this.
All the best,
Sebastian
Post by Yaroslav Blanter
Hi Sebastian,
This is huge! It will cover almost all currently existing German
companies. Many of these will have similar names, so preparing for
disambiguation is a concern.
A good way for such an approach would be proposing a property for an
external identifier, loading the data into Mix-n-match, creating
links for companies already in Wikidata, and adding the rest (or
perhaps only parts of them - I’m not sure if having all of them in
Wikidata makes sense, but that’s another discussion), preferably with
location and/or sector of trade in the description field.
I’ve tried to figure out what could be used as key for a external
identifier property. However, it looks like the registry does not
offer any (persistent) URL to its entries. So for looking up a
-conducting an extended search for the exact string “A&A
Dienstleistungsgesellschaft mbH“
-copying the register number “32853” plus selecting the court
(Leipzig) from the according dropdown list and search that
Both ways are not very intuitive, even if we can provide a link to
the search form. This would make a weak connection to the source of
information. Much more important, it makes disambiguation in
Mix-n-match difficult. This applies for the preparation of your
initial load (you would not want to create duplicates). But much more
so for everybody else who wants to match his or her data later on.
Being forced to search for entries manually in a cumbersome way for
disambiguation of a new, possibly large and rich dataset is, in my
eyes, not something we want to impose on future contributors. And
often, the free information they find in the registry (formal name,
register number, legal form, address) will not easily match with the
information they have (common name, location, perhaps founding date,
and most important sector of trade), so disambiguation may still be
difficult.
Have you checked which parts of the accessible information as below
can be crawled and added legally to external databases such as Wikidata?
Cheers, Joachim
--
Joachim Neubert
ZBW – German National Library of Economics
Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-42834-462
Auftrag von *Sebastian Hellmann
*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
organisations to Wikidata
Hi all,
the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e.
https://www.handelsregister.de/rp_web/mask.do?Typ=e
I would like to add this to Wikidata, either by crawling or by
raising money to use crowdsourcing concepts like crowdflour or amazon
turk.
https://www.wikidata.org/wiki/Wikidata:Notability
2. It refers to an instance of a *clearly identifiable conceptual
or material entity*. The entity must be notable, in the sense
that it *can be described using serious and publicly available
references*. If there is no item about you yet, you are probably
not notable.
The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly identifiable
legal entities.
How can I get clearance to proceed on this?
All the best,
Sebastian
Entity data
Saxony District court *Leipzig HRB 32853 * – A&A
Dienstleistungsgesellschaft mbH
Gesellschaft mit beschrÀnkter Haftung
25.000,00 EUR
29/08/2016
(When entering date of entry, wrong data input can occur due to system failures!)
-
-
A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
h***@informatik.uni-leipzig.de
2017-10-16 12:39:51 UTC
Permalink
The best way then to not create duplicates is to look at all existing organizations in Wikidata and add the court and court number manually, if they are German and then exclude these from the import.

Guarantees that there will be no duplicates.

So the technical side is feasible.
Barriers are political and legal.

Sebastian
Post by Sebastian Hellmann
there is no URI or unique identifier given by the Handelsregister
system. However, the courts take care that the registrations are unique,
so it is implicit. Handelsregister could easily create stable URIs out
of the court+type+number like /Leipzig_HRB_32853
For Wikidata this is not a problem to handle. So no technical issues
from this side either.
All the best,
Sebastian
Post by Sebastian Hellmann
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find
and deduplicate
- The crawl can be done easily, a colleague has done so before.
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under open
license, I would probably not suggest loading it into wikidata, as
the
Post by Sebastian Hellmann
data can be retrieved from the official source directly, however,
here
Post by Sebastian Hellmann
this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e.
only facts is ok for wikidata. This done in a lot of places, I guess.
Same for Wikipedia, i.e. News articles and copyrighted books are
referenced. So Wikimedia or the Wikimedia community are experts on
this.
Post by Sebastian Hellmann
All the best,
Sebastian
Post by Yaroslav Blanter
Hi Sebastian,
This is huge! It will cover almost all currently existing German
companies. Many of these will have similar names, so preparing for
disambiguation is a concern.
A good way for such an approach would be proposing a property for an
external identifier, loading the data into Mix-n-match, creating
links for companies already in Wikidata, and adding the rest (or
perhaps only parts of them - I’m not sure if having all of them in
Wikidata makes sense, but that’s another discussion), preferably
with
Post by Sebastian Hellmann
Post by Yaroslav Blanter
location and/or sector of trade in the description field.
I’ve tried to figure out what could be used as key for a external
identifier property. However, it looks like the registry does not
offer any (persistent) URL to its entries. So for looking up a
-conducting an extended search for the exact string “A&A
Dienstleistungsgesellschaft mbH“
-copying the register number “32853” plus selecting the court
(Leipzig) from the according dropdown list and search that
Both ways are not very intuitive, even if we can provide a link to
the search form. This would make a weak connection to the source of
information. Much more important, it makes disambiguation in
Mix-n-match difficult. This applies for the preparation of your
initial load (you would not want to create duplicates). But much
more
Post by Sebastian Hellmann
Post by Yaroslav Blanter
so for everybody else who wants to match his or her data later on.
Being forced to search for entries manually in a cumbersome way for
disambiguation of a new, possibly large and rich dataset is, in my
eyes, not something we want to impose on future contributors. And
often, the free information they find in the registry (formal name,
register number, legal form, address) will not easily match with the
information they have (common name, location, perhaps founding date,
and most important sector of trade), so disambiguation may still be
difficult.
Have you checked which parts of the accessible information as below
can be crawled and added legally to external databases such as
Wikidata?
Post by Sebastian Hellmann
Post by Yaroslav Blanter
Cheers, Joachim
--
Joachim Neubert
ZBW – German National Library of Economics
Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-42834-462
Auftrag von *Sebastian Hellmann
*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
organisations to Wikidata
Hi all,
the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e.
the info you are searching for at and clicking on UT (see example
https://www.handelsregister.de/rp_web/mask.do?Typ=e
I would like to add this to Wikidata, either by crawling or by
raising money to use crowdsourcing concepts like crowdflour or
amazon
Post by Sebastian Hellmann
Post by Yaroslav Blanter
turk.
https://www.wikidata.org/wiki/Wikidata:Notability
2. It refers to an instance of a *clearly identifiable
conceptual
Post by Sebastian Hellmann
Post by Yaroslav Blanter
or material entity*. The entity must be notable, in the sense
that it *can be described using serious and publicly available
references*. If there is no item about you yet, you are probably
not notable.
The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly
identifiable
Post by Sebastian Hellmann
Post by Yaroslav Blanter
legal entities.
How can I get clearance to proceed on this?
All the best,
Sebastian
Entity data
Saxony District court *Leipzig HRB 32853 * – A&A
Dienstleistungsgesellschaft mbH
Gesellschaft mit beschrÀnkter Haftung
25.000,00 EUR
29/08/2016
(When entering date of entry, wrong data input can occur due to system failures!)
-
-
A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT)
Post by Sebastian Hellmann
Post by Yaroslav Blanter
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig
University
Post by Sebastian Hellmann
Post by Yaroslav Blanter
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig
University
Post by Sebastian Hellmann
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
--
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
Ettore RIZZA
2017-10-16 12:46:15 UTC
Permalink
https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A
Post by Sebastian Hellmann
%3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,

I think Wikidata contains many more organizations than that. If we choose
the "instance of Business enterprise", we get 135570 results. And I imagine
there are many other categories that bring together commercial companies.


https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D

On the substance, the project to add all companies of a country would make
Wikidata a kind of totally free clone of Open Corporates
<https://opencorporates.com/>. I would of course be delighted to see that,
but is it not a challenge to maintain such a database? Companies are like
humans, it appears and disappears every day.



2017-10-16 13:41 GMT+02:00 Sebastian Hellmann <
Post by Sebastian Hellmann
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A
%3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find and
deduplicate
- The crawl can be done easily, a colleague has done so before.
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under open
license, I would probably not suggest loading it into wikidata, as the data
can be retrieved from the official source directly, however, here this data
will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e. only
facts is ok for wikidata. This done in a lot of places, I guess. Same for
Wikipedia, i.e. News articles and copyrighted books are referenced. So
Wikimedia or the Wikimedia community are experts on this.
All the best,
Sebastian
Hi Sebastian,
This is huge! It will cover almost all currently existing German
companies. Many of these will have similar names, so preparing for
disambiguation is a concern.
A good way for such an approach would be proposing a property for an
external identifier, loading the data into Mix-n-match, creating links for
companies already in Wikidata, and adding the rest (or perhaps only parts
of them - I’m not sure if having all of them in Wikidata makes sense, but
that’s another discussion), preferably with location and/or sector of trade
in the description field.
I’ve tried to figure out what could be used as key for a external
identifier property. However, it looks like the registry does not offer any
(persistent) URL to its entries. So for looking up a company, apparently
- conducting an extended search for the exact string “A&A
Dienstleistungsgesellschaft mbH“
- copying the register number “32853” plus selecting the court
(Leipzig) from the according dropdown list and search that
Both ways are not very intuitive, even if we can provide a link to the
search form. This would make a weak connection to the source of
information. Much more important, it makes disambiguation in Mix-n-match
difficult. This applies for the preparation of your initial load (you would
not want to create duplicates). But much more so for everybody else who
wants to match his or her data later on. Being forced to search for entries
manually in a cumbersome way for disambiguation of a new, possibly large
and rich dataset is, in my eyes, not something we want to impose on future
contributors. And often, the free information they find in the registry
(formal name, register number, legal form, address) will not easily match
with the information they have (common name, location, perhaps founding
date, and most important sector of trade), so disambiguation may still be
difficult.
Have you checked which parts of the accessible information as below can be
crawled and added legally to external databases such as Wikidata?
Cheers, Joachim
--
Joachim Neubert
ZBW – German National Library of Economics
Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-42834-462
Hellmann
*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
organisations to Wikidata
Hi all,
the German business registry contains roughly 2.2 million organisations.
Some information is paid, but other is public, i.e. the info you are
https://www.handelsregister.de/rp_web/mask.do?Typ=e
I would like to add this to Wikidata, either by crawling or by raising
money to use crowdsourcing concepts like crowdflour or amazon turk.
It should meet notability criteria 2: https://www.wikidata.org/wiki/
Wikidata:Notability
2. It refers to an instance of a *clearly identifiable conceptual or
material entity*. The entity must be notable, in the sense that it *can
be described using serious and publicly available references*. If there
is no item about you yet, you are probably not notable.
The reference is the official German business registry, which is serious
and public. Orgs are also per definition clearly identifiable legal
entities.
How can I get clearance to proceed on this?
All the best,
Sebastian
Entity data
Saxony District court *Leipzig HRB 32853 * – A&A
Dienstleistungsgesellschaft mbH
Gesellschaft mit beschrÀnkter Haftung
25.000,00 EUR
29/08/2016
(When entering date of entry, wrong data input can occur due to system failures!)
-
-
A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
Antonin Delpeuch (lists)
2017-10-16 13:16:14 UTC
Permalink
Thanks Ettore for spotting that!

Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).

So, to retrieve all items of a certain type in SPARQL, you need to use
something like this:

?item wdt:P31/wdt:P279* ?type

You can also have other variants to accept non-truthy statements.

Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"…

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.

I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future…

Antonin
- Wikidata has 40k organisations: 
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi, 
I think Wikidata contains many more organizations than that. If we
choose the "instance of Business enterprise", we get 135570 results. And
I imagine there are many other categories that bring together commercial
companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
On the substance, the project to add all companies of a country would
make Wikidata a kind of totally free clone of Open Corporates
<https://opencorporates.com/>. I would of course be delighted to see
that, but is it not a challenge to maintain such a database? Companies
are like humans, it appears and disappears every day.
 
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find
and deduplicate
- The crawl can be done easily, a colleague has done so before.  
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under
open license, I would probably not suggest loading it into wikidata,
as the data can be retrieved from the official source directly,
however, here this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e.
only facts is ok for wikidata. This done in a lot of places, I
guess. Same for Wikipedia, i.e. News articles and copyrighted books
are referenced. So Wikimedia or the Wikimedia community are experts
on this.
All the best,
Sebastian
Hi Sebastian,____
__ __
This is huge! It will cover almost all currently existing German
companies. Many of these will have similar names, so preparing for
disambiguation is a concern.____
__ __
A good way for such an approach would be proposing a property for
an external identifier, loading the data into Mix-n-match,
creating links for companies already in Wikidata, and adding the
rest (or perhaps only parts of them - I’m not sure if having all
of them in Wikidata makes sense, but that’s another discussion),
preferably with location and/or sector of trade in the description
field.____
__ __
I’ve tried to figure out what could be used as key for a external
identifier property. However, it looks like the registry does not
offer any (persistent) URL to its entries. So for looking up a
company, apparently there are two options:____
__ __
-          conducting an extended search for the exact string “A&A
Dienstleistungsgesellschaft mbH“____
-          copying the register number “32853” plus selecting the
court (Leipzig) from the according dropdown list and search that____
__ __
Both ways are not very intuitive, even if we can provide a link to
the search form. This would make a weak connection to the source
of information. Much more important, it makes disambiguation in
Mix-n-match difficult. This applies for the preparation of your
initial load (you would not want to create duplicates). But much
more so for everybody else who wants to match his or her data
later on. Being forced to search for entries manually in a
cumbersome way for disambiguation of a new, possibly large and
rich dataset is, in my eyes, not something we want to impose on
future contributors. And often, the free information they find in
the registry (formal name, register number, legal form, address)
will not easily match with the information they have (common name,
location, perhaps founding date, and most important sector of
trade), so disambiguation may still be difficult.____
__ __
Have you checked which parts of the accessible information as
below can be crawled and added legally to external databases such
as Wikidata?____
__ __
Cheers, Joachim____
__ __
--____
Joachim Neubert____
__ __
ZBW – German National Library of Economics____
Leibniz Information Centre for Economics____
Neuer Jungfernstieg 21
20354 Hamburg____
Phone +49-42834-462____
__ __
__ __
__ __
*Sebastian Hellmann
*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
organisations to Wikidata____
__ __
Hi all,____
the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e.
the info you are searching for at and clicking on UT (see example
below):____
https://www.handelsregister.de/rp_web/mask.do?Typ=e
<https://www.handelsregister.de/rp_web/mask.do?Typ=e>____
__ __
I would like to add this to Wikidata, either by crawling or by
raising money to use crowdsourcing concepts like crowdflour or
amazon turk. ____
__ __
https://www.wikidata.org/wiki/Wikidata:Notability
<https://www.wikidata.org/wiki/Wikidata:Notability>____
2. It refers to an instance of a *clearly identifiable
conceptual or material entity*. The entity must be notable, in
the sense that it *can be described using serious and publicly
available references*. If there is no item about you yet, you
are probably not notable.____
The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly
identifiable legal entities.
How can I get clearance to proceed on this?
All the best,
Sebastian____
__ __
__ __
Entity data____
__ __
Saxony District court *Leipzig HRB 32853 * – A&A
Dienstleistungsgesellschaft mbH ____
Legal status:____
Gesellschaft mit beschränkter Haftung  ____
Capital:____
25.000,00 EUR ____
Date of entry:____
29/08/2016
(When entering date of entry, wrong data input can occur due to
system failures!) ____
Date of removal:____
- ____
Balance sheet available: ____
- ____
Address (subject to correction):____
A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40____
04317 Leipzig ____
__ __
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org____
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
Antonin Delpeuch (lists)
2017-10-16 13:34:57 UTC
Permalink
And… my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).

So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl

It's easy to get these things wrong!

Antonin
Post by Antonin Delpeuch (lists)
Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).
So, to retrieve all items of a certain type in SPARQL, you need to use
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"…
So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future…
Antonin
- Wikidata has 40k organisations: 
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi, 
I think Wikidata contains many more organizations than that. If we
choose the "instance of Business enterprise", we get 135570 results. And
I imagine there are many other categories that bring together commercial
companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
On the substance, the project to add all companies of a country would
make Wikidata a kind of totally free clone of Open Corporates
<https://opencorporates.com/>. I would of course be delighted to see
that, but is it not a challenge to maintain such a database? Companies
are like humans, it appears and disappears every day.
 
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find
and deduplicate
- The crawl can be done easily, a colleague has done so before.  
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under
open license, I would probably not suggest loading it into wikidata,
as the data can be retrieved from the official source directly,
however, here this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e.
only facts is ok for wikidata. This done in a lot of places, I
guess. Same for Wikipedia, i.e. News articles and copyrighted books
are referenced. So Wikimedia or the Wikimedia community are experts
on this.
All the best,
Sebastian
Hi Sebastian,____
__ __
This is huge! It will cover almost all currently existing German
companies. Many of these will have similar names, so preparing for
disambiguation is a concern.____
__ __
A good way for such an approach would be proposing a property for
an external identifier, loading the data into Mix-n-match,
creating links for companies already in Wikidata, and adding the
rest (or perhaps only parts of them - I’m not sure if having all
of them in Wikidata makes sense, but that’s another discussion),
preferably with location and/or sector of trade in the description
field.____
__ __
I’ve tried to figure out what could be used as key for a external
identifier property. However, it looks like the registry does not
offer any (persistent) URL to its entries. So for looking up a
company, apparently there are two options:____
__ __
-          conducting an extended search for the exact string “A&A
Dienstleistungsgesellschaft mbH“____
-          copying the register number “32853” plus selecting the
court (Leipzig) from the according dropdown list and search that____
__ __
Both ways are not very intuitive, even if we can provide a link to
the search form. This would make a weak connection to the source
of information. Much more important, it makes disambiguation in
Mix-n-match difficult. This applies for the preparation of your
initial load (you would not want to create duplicates). But much
more so for everybody else who wants to match his or her data
later on. Being forced to search for entries manually in a
cumbersome way for disambiguation of a new, possibly large and
rich dataset is, in my eyes, not something we want to impose on
future contributors. And often, the free information they find in
the registry (formal name, register number, legal form, address)
will not easily match with the information they have (common name,
location, perhaps founding date, and most important sector of
trade), so disambiguation may still be difficult.____
__ __
Have you checked which parts of the accessible information as
below can be crawled and added legally to external databases such
as Wikidata?____
__ __
Cheers, Joachim____
__ __
--____
Joachim Neubert____
__ __
ZBW – German National Library of Economics____
Leibniz Information Centre for Economics____
Neuer Jungfernstieg 21
20354 Hamburg____
Phone +49-42834-462____
__ __
__ __
__ __
*Sebastian Hellmann
*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
organisations to Wikidata____
__ __
Hi all,____
the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e.
the info you are searching for at and clicking on UT (see example
below):____
https://www.handelsregister.de/rp_web/mask.do?Typ=e
<https://www.handelsregister.de/rp_web/mask.do?Typ=e>____
__ __
I would like to add this to Wikidata, either by crawling or by
raising money to use crowdsourcing concepts like crowdflour or
amazon turk. ____
__ __
https://www.wikidata.org/wiki/Wikidata:Notability
<https://www.wikidata.org/wiki/Wikidata:Notability>____
2. It refers to an instance of a *clearly identifiable
conceptual or material entity*. The entity must be notable, in
the sense that it *can be described using serious and publicly
available references*. If there is no item about you yet, you
are probably not notable.____
The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly
identifiable legal entities.
How can I get clearance to proceed on this?
All the best,
Sebastian____
__ __
__ __
Entity data____
__ __
Saxony District court *Leipzig HRB 32853 * – A&A
Dienstleistungsgesellschaft mbH ____
Legal status:____
Gesellschaft mit beschränkter Haftung  ____
Capital:____
25.000,00 EUR ____
Date of entry:____
29/08/2016
(When entering date of entry, wrong data input can occur due to
system failures!) ____
Date of removal:____
- ____
Balance sheet available: ____
- ____
Address (subject to correction):____
A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40____
04317 Leipzig ____
__ __
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org____
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
Ettore RIZZA
2017-10-16 14:08:09 UTC
Permalink
@Antonin : Thanks for this counting method, it seems very effective (I
already knew that there were 3.6 M of humans (Q5) in Wikidata).

https://query.wikidata.org/#%23compter%20le%20nombre%20d%27%C3%A9l%C3%A9ments%20appartenant%20%C3%A0%20la%20cat%C3%A9gorie%0A%23organisation%20ou%20%C3%A0%20ses%20enfants%0ASELECT%20DISTINCT%20%28COUNT%28DISTINCT%20%3Fitem%29%20AS%20%3Fcount%29%20WHERE%20%7B%20%3Fitem%20%28wdt%3AP31%2Fwdt%3AP279%2a%29%20wd%3AQ5.%20%7D

2017-10-16 15:34 GMT+02:00 Antonin Delpeuch (lists) <
And
 my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).
So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl
It's easy to get these things wrong!
Antonin
Post by Antonin Delpeuch (lists)
Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).
So, to retrieve all items of a certain type in SPARQL, you need to use
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future

Antonin
Post by Sebastian Hellmann
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we
choose the "instance of Business enterprise", we get 135570 results. And
I imagine there are many other categories that bring together commercial
companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%
3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%
3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%
3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_
LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
On the substance, the project to add all companies of a country would
make Wikidata a kind of totally free clone of Open Corporates
<https://opencorporates.com/>. I would of course be delighted to see
that, but is it not a challenge to maintain such a database? Companies
are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find
and deduplicate
- The crawl can be done easily, a colleague has done so before.
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under
open license, I would probably not suggest loading it into wikidata,
as the data can be retrieved from the official source directly,
however, here this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e.
only facts is ok for wikidata. This done in a lot of places, I
guess. Same for Wikipedia, i.e. News articles and copyrighted books
are referenced. So Wikimedia or the Wikimedia community are experts
on this.
All the best,
Sebastian
Hi Sebastian,____
__ __
This is huge! It will cover almost all currently existing German
companies. Many of these will have similar names, so preparing for
disambiguation is a concern.____
__ __
A good way for such an approach would be proposing a property for
an external identifier, loading the data into Mix-n-match,
creating links for companies already in Wikidata, and adding the
rest (or perhaps only parts of them - I’m not sure if having all
of them in Wikidata makes sense, but that’s another discussion),
preferably with location and/or sector of trade in the description
field.____
__ __
I’ve tried to figure out what could be used as key for a external
identifier property. However, it looks like the registry does not
offer any (persistent) URL to its entries. So for looking up a
company, apparently there are two options:____
__ __
- conducting an extended search for the exact string “A&A
Dienstleistungsgesellschaft mbH“____
- copying the register number “32853” plus selecting the
court (Leipzig) from the according dropdown list and search
that____
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
__ __
Both ways are not very intuitive, even if we can provide a link to
the search form. This would make a weak connection to the source
of information. Much more important, it makes disambiguation in
Mix-n-match difficult. This applies for the preparation of your
initial load (you would not want to create duplicates). But much
more so for everybody else who wants to match his or her data
later on. Being forced to search for entries manually in a
cumbersome way for disambiguation of a new, possibly large and
rich dataset is, in my eyes, not something we want to impose on
future contributors. And often, the free information they find in
the registry (formal name, register number, legal form, address)
will not easily match with the information they have (common name,
location, perhaps founding date, and most important sector of
trade), so disambiguation may still be difficult.____
__ __
Have you checked which parts of the accessible information as
below can be crawled and added legally to external databases such
as Wikidata?____
__ __
Cheers, Joachim____
__ __
--____
Joachim Neubert____
__ __
ZBW – German National Library of Economics____
Leibniz Information Centre for Economics____
Neuer Jungfernstieg 21
20354 Hamburg____
Phone +49-42834-462____
__ __
__ __
__ __
*Sebastian Hellmann
*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
organisations to Wikidata____
__ __
Hi all,____
the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e.
the info you are searching for at and clicking on UT (see example
below):____
https://www.handelsregister.de/rp_web/mask.do?Typ=e
<https://www.handelsregister.de/rp_web/mask.do?Typ=e>____
__ __
I would like to add this to Wikidata, either by crawling or by
raising money to use crowdsourcing concepts like crowdflour or
amazon turk. ____
__ __
https://www.wikidata.org/wiki/Wikidata:Notability
<https://www.wikidata.org/wiki/Wikidata:Notability>____
2. It refers to an instance of a *clearly identifiable
conceptual or material entity*. The entity must be notable, in
the sense that it *can be described using serious and publicly
available references*. If there is no item about you yet, you
are probably not notable.____
The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly
identifiable legal entities.
How can I get clearance to proceed on this?
All the best,
Sebastian____
__ __
__ __
Entity data____
__ __
Saxony District court *Leipzig HRB 32853 * – A&A
Dienstleistungsgesellschaft mbH ____
Legal status:____
Gesellschaft mit beschrÀnkter Haftung ____
Capital:____
25.000,00 EUR ____
Date of entry:____
29/08/2016
(When entering date of entry, wrong data input can occur due to
system failures!) ____
Date of removal:____
- ____
Balance sheet available: ____
- ____
Address (subject to correction):____
A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40____
04317 Leipzig ____
__ __
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig
University
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org____
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig
University
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
Ettore RIZZA
2017-10-16 14:17:12 UTC
Permalink
While I'm on the subject, I would like to draw attention to the Neckar
project <http://event.ifi.uni-heidelberg.de/?page_id=532>, which aims
precisely to classify Wikidata entities in people, places and
organizations. Frequently updated Json dumps are available.
Post by Ettore RIZZA
@Antonin : Thanks for this counting method, it seems very effective (I
already knew that there were 3.6 M of humans (Q5) in Wikidata).
https://query.wikidata.org/#%23compter%20le%20nombre%20d%
27%C3%A9l%C3%A9ments%20appartenant%20%C3%A0%20la%20cat%C3%A9gorie%0A%
23organisation%20ou%20%C3%A0%20ses%20enfants%0ASELECT%
20DISTINCT%20%28COUNT%28DISTINCT%20%3Fitem%29%20AS%
20%3Fcount%29%20WHERE%20%7B%20%3Fitem%20%28wdt%3AP31%
2Fwdt%3AP279%2a%29%20wd%3AQ5.%20%7D
2017-10-16 15:34 GMT+02:00 Antonin Delpeuch (lists) <
And
 my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).
So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl
It's easy to get these things wrong!
Antonin
Post by Antonin Delpeuch (lists)
Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).
So, to retrieve all items of a certain type in SPARQL, you need to use
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future

Antonin
Post by Sebastian Hellmann
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel
{
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we
choose the "instance of Business enterprise", we get 135570 results.
And
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
I imagine there are many other categories that bring together
commercial
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%
20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%
0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AservicePa
ram%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
On the substance, the project to add all companies of a country would
make Wikidata a kind of totally free clone of Open Corporates
<https://opencorporates.com/>. I would of course be delighted to see
that, but is it not a challenge to maintain such a database? Companies
are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel
{
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to
find
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
and deduplicate
- The crawl can be done easily, a colleague has done so before.
- Do you want to upload the data in Wikidata? It would be a real
big
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
extension. Can I go ahead
- If the data were available externally as structured data under
open license, I would probably not suggest loading it into
wikidata,
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
as the data can be retrieved from the official source directly,
however, here this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e.
only facts is ok for wikidata. This done in a lot of places, I
guess. Same for Wikipedia, i.e. News articles and copyrighted books
are referenced. So Wikimedia or the Wikimedia community are experts
on this.
All the best,
Sebastian
Hi Sebastian,____
__ __
This is huge! It will cover almost all currently existing German
companies. Many of these will have similar names, so preparing for
disambiguation is a concern.____
__ __
A good way for such an approach would be proposing a property for
an external identifier, loading the data into Mix-n-match,
creating links for companies already in Wikidata, and adding the
rest (or perhaps only parts of them - I’m not sure if having all
of them in Wikidata makes sense, but that’s another discussion),
preferably with location and/or sector of trade in the description
field.____
__ __
I’ve tried to figure out what could be used as key for a external
identifier property. However, it looks like the registry does not
offer any (persistent) URL to its entries. So for looking up a
company, apparently there are two options:____
__ __
- conducting an extended search for the exact string “A&A
Dienstleistungsgesellschaft mbH“____
- copying the register number “32853” plus selecting the
court (Leipzig) from the according dropdown list and search
that____
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
__ __
Both ways are not very intuitive, even if we can provide a link to
the search form. This would make a weak connection to the source
of information. Much more important, it makes disambiguation in
Mix-n-match difficult. This applies for the preparation of your
initial load (you would not want to create duplicates). But much
more so for everybody else who wants to match his or her data
later on. Being forced to search for entries manually in a
cumbersome way for disambiguation of a new, possibly large and
rich dataset is, in my eyes, not something we want to impose on
future contributors. And often, the free information they find in
the registry (formal name, register number, legal form, address)
will not easily match with the information they have (common name,
location, perhaps founding date, and most important sector of
trade), so disambiguation may still be difficult.____
__ __
Have you checked which parts of the accessible information as
below can be crawled and added legally to external databases such
as Wikidata?____
__ __
Cheers, Joachim____
__ __
--____
Joachim Neubert____
__ __
ZBW – German National Library of Economics____
Leibniz Information Centre for Economics____
Neuer Jungfernstieg 21
20354 Hamburg____
Phone +49-42834-462____
__ __
__ __
__ __
*Sebastian Hellmann
*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
organisations to Wikidata____
__ __
Hi all,____
the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e.
the info you are searching for at and clicking on UT (see example
below):____
https://www.handelsregister.de/rp_web/mask.do?Typ=e
<https://www.handelsregister.de/rp_web/mask.do?Typ=e>____
__ __
I would like to add this to Wikidata, either by crawling or by
raising money to use crowdsourcing concepts like crowdflour or
amazon turk. ____
__ __
https://www.wikidata.org/wiki/Wikidata:Notability
<https://www.wikidata.org/wiki/Wikidata:Notability>____
2. It refers to an instance of a *clearly identifiable
conceptual or material entity*. The entity must be notable, in
the sense that it *can be described using serious and publicly
available references*. If there is no item about you yet, you
are probably not notable.____
The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly
identifiable legal entities.
How can I get clearance to proceed on this?
All the best,
Sebastian____
__ __
__ __
Entity data____
__ __
Saxony District court *Leipzig HRB 32853 * – A&A
Dienstleistungsgesellschaft mbH ____
Legal status:____
Gesellschaft mit beschrÀnkter Haftung ____
Capital:____
25.000,00 EUR ____
Date of entry:____
29/08/2016
(When entering date of entry, wrong data input can occur due to
system failures!) ____
Date of removal:____
- ____
Balance sheet available: ____
- ____
Address (subject to correction):____
A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40____
04317 Leipzig ____
__ __
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig
University
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org____
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig
University
Post by Antonin Delpeuch (lists)
Post by Sebastian Hellmann
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
Sebastian Hellmann
2017-10-16 15:53:22 UTC
Permalink
ah, ok, sorry, I was assuming that Blazegraph would transitively resolve
this automatically.

Ok, so let's divide the problem:

# Task 1:

Connect all existing organisations with the data from the
handelsregister. (No new identifiers added, we can start right now)

Add a constraint that all German organisations should be connected to a
court, i.e. the registering organisation as well as the id assigned by
the court.

@all: any properties I can reuse for this?

I will focus on this as it seems quite easy. We can first filter orgs by
other criteria, i.e. country as a blocking key and then string match the
rest.

# Task 2:

Add all missing identifiers for the remaining orgs in Handelsregister.
Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently.


# regarding maintenance:
I find Wikidata as such very hard to maintain as all data is copied from
somewhere else eventually, but Wikipedia has the same problem. In the
case of the German Business register, maintenance is especially easy as
the orgs are stable and uniquely identifiable. Even the fact that a
company gets shut down should still be in Wikidata, so you have
historical information. I mean, you also keep the Roman Empire, the
Hanse and even finished projects in Wikidata. So even if an org ceases
to exist, the entry in Wikidata should stay.

# regarding Opencorporates
I have a critical opinion with Opencorporates. It appears to be open,
but you actually can not get the data. If somebody has a data dump,
please forward to me. Thanks.
More on top, I consider Opencorporates a danger to open data. It appears
to push open availability of data, but then it is limited to open
licenses. Usefulness is limited as there are no free dumps and no
possibility to duplicate it effectlively. Wikipedia and Wikidata provide
dumps and an API for exactly this reason. Everytime somebody wants to
create an open organisation dataset with no barriers, the existence of
Opencorporates is blocking this.

Cheers,
Sebastian
And
 my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).
So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl
It's easy to get these things wrong!
Antonin
Post by Antonin Delpeuch (lists)
Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).
So, to retrieve all items of a certain type in SPARQL, you need to use
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future

Antonin
Post by Sebastian Hellmann
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we
choose the "instance of Business enterprise", we get 135570 results. And
I imagine there are many other categories that bring together commercial
companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
On the substance, the project to add all companies of a country would
make Wikidata a kind of totally free clone of Open Corporates
<https://opencorporates.com/>. I would of course be delighted to see
that, but is it not a challenge to maintain such a database? Companies
are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find
and deduplicate
- The crawl can be done easily, a colleague has done so before.
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under
open license, I would probably not suggest loading it into wikidata,
as the data can be retrieved from the official source directly,
however, here this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e.
only facts is ok for wikidata. This done in a lot of places, I
guess. Same for Wikipedia, i.e. News articles and copyrighted books
are referenced. So Wikimedia or the Wikimedia community are experts
on this.
All the best,
Sebastian
Hi Sebastian,____
__ __
This is huge! It will cover almost all currently existing German
companies. Many of these will have similar names, so preparing for
disambiguation is a concern.____
__ __
A good way for such an approach would be proposing a property for
an external identifier, loading the data into Mix-n-match,
creating links for companies already in Wikidata, and adding the
rest (or perhaps only parts of them - I’m not sure if having all
of them in Wikidata makes sense, but that’s another discussion),
preferably with location and/or sector of trade in the description
field.____
__ __
I’ve tried to figure out what could be used as key for a external
identifier property. However, it looks like the registry does not
offer any (persistent) URL to its entries. So for looking up a
company, apparently there are two options:____
__ __
-          conducting an extended search for the exact string “A&A
Dienstleistungsgesellschaft mbH“____
-          copying the register number “32853” plus selecting the
court (Leipzig) from the according dropdown list and search that____
__ __
Both ways are not very intuitive, even if we can provide a link to
the search form. This would make a weak connection to the source
of information. Much more important, it makes disambiguation in
Mix-n-match difficult. This applies for the preparation of your
initial load (you would not want to create duplicates). But much
more so for everybody else who wants to match his or her data
later on. Being forced to search for entries manually in a
cumbersome way for disambiguation of a new, possibly large and
rich dataset is, in my eyes, not something we want to impose on
future contributors. And often, the free information they find in
the registry (formal name, register number, legal form, address)
will not easily match with the information they have (common name,
location, perhaps founding date, and most important sector of
trade), so disambiguation may still be difficult.____
__ __
Have you checked which parts of the accessible information as
below can be crawled and added legally to external databases such
as Wikidata?____
__ __
Cheers, Joachim____
__ __
--____
Joachim Neubert____
__ __
ZBW – German National Library of Economics____
Leibniz Information Centre for Economics____
Neuer Jungfernstieg 21
20354 Hamburg____
Phone +49-42834-462____
__ __
__ __
__ __
*Sebastian Hellmann
*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
organisations to Wikidata____
__ __
Hi all,____
the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e.
the info you are searching for at and clicking on UT (see example
below):____
https://www.handelsregister.de/rp_web/mask.do?Typ=e
<https://www.handelsregister.de/rp_web/mask.do?Typ=e>____
__ __
I would like to add this to Wikidata, either by crawling or by
raising money to use crowdsourcing concepts like crowdflour or
amazon turk. ____
__ __
https://www.wikidata.org/wiki/Wikidata:Notability
<https://www.wikidata.org/wiki/Wikidata:Notability>____
2. It refers to an instance of a *clearly identifiable
conceptual or material entity*. The entity must be notable, in
the sense that it *can be described using serious and publicly
available references*. If there is no item about you yet, you
are probably not notable.____
The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly
identifiable legal entities.
How can I get clearance to proceed on this?
All the best,
Sebastian____
__ __
__ __
Entity data____
__ __
Saxony District court *Leipzig HRB 32853 * – A&A
Dienstleistungsgesellschaft mbH ____
Legal status:____
Gesellschaft mit beschrÀnkter Haftung  ____
Capital:____
25.000,00 EUR ____
Date of entry:____
29/08/2016
(When entering date of entry, wrong data input can occur due to
system failures!) ____
Date of removal:____
- ____
Balance sheet available: ____
- ____
Address (subject to correction):____
A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40____
04317 Leipzig ____
__ __
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org____
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
Yaroslav Blanter
2017-10-16 16:06:28 UTC
Permalink
Dear All,

it is great that we are having this discussion, but may I please suggest to
have it on the RfP page on Wikidata? People already asked similar questions
there, and, in my experience, on-wiki discussion will likely lead to
refined request which will accomodate all suggestions.

Cheers
Yaroslav

On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann <
Post by Sebastian Hellmann
ah, ok, sorry, I was assuming that Blazegraph would transitively resolve
this automatically.
Connect all existing organisations with the data from the handelsregister.
(No new identifiers added, we can start right now)
Add a constraint that all German organisations should be connected to a
court, i.e. the registering organisation as well as the id assigned by the
court.
@all: any properties I can reuse for this?
I will focus on this as it seems quite easy. We can first filter orgs by
other criteria, i.e. country as a blocking key and then string match the
rest.
Add all missing identifiers for the remaining orgs in Handelsregister.
Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently.
I find Wikidata as such very hard to maintain as all data is copied from
somewhere else eventually, but Wikipedia has the same problem. In the case
of the German Business register, maintenance is especially easy as the orgs
are stable and uniquely identifiable. Even the fact that a company gets
shut down should still be in Wikidata, so you have historical information.
I mean, you also keep the Roman Empire, the Hanse and even finished
projects in Wikidata. So even if an org ceases to exist, the entry in
Wikidata should stay.
# regarding Opencorporates
I have a critical opinion with Opencorporates. It appears to be open, but
you actually can not get the data. If somebody has a data dump, please
forward to me. Thanks.
More on top, I consider Opencorporates a danger to open data. It appears
to push open availability of data, but then it is limited to open licenses.
Usefulness is limited as there are no free dumps and no possibility to
duplicate it effectlively. Wikipedia and Wikidata provide dumps and an API
for exactly this reason. Everytime somebody wants to create an open
organisation dataset with no barriers, the existence of Opencorporates is
blocking this.
Cheers,
Sebastian
And
 my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).
So, I get 1 168 084 now.http://tinyurl.com/yaeqlsnl
It's easy to get these things wrong!
Antonin
Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).
So, to retrieve all items of a certain type in SPARQL, you need to use
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future

Antonin
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we
choose the "instance of Business enterprise", we get 135570 results. And
I imagine there are many other categories that bring together commercial
companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
On the substance, the project to add all companies of a country would
make Wikidata a kind of totally free clone of Open Corporates<https://opencorporates.com/> <https://opencorporates.com/>. I would of course be delighted to see
that, but is it not a challenge to maintain such a database? Companies
are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find
and deduplicate
- The crawl can be done easily, a colleague has done so before.
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under
open license, I would probably not suggest loading it into wikidata,
as the data can be retrieved from the official source directly,
however, here this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e.
only facts is ok for wikidata. This done in a lot of places, I
guess. Same for Wikipedia, i.e. News articles and copyrighted books
are referenced. So Wikimedia or the Wikimedia community are experts
on this.
All the best,
Sebastian
Hi Sebastian,____
__ __
This is huge! It will cover almost all currently existing German
companies. Many of these will have similar names, so preparing for
disambiguation is a concern.____
__ __
A good way for such an approach would be proposing a property for
an external identifier, loading the data into Mix-n-match,
creating links for companies already in Wikidata, and adding the
rest (or perhaps only parts of them - I’m not sure if having all
of them in Wikidata makes sense, but that’s another discussion),
preferably with location and/or sector of trade in the description
field.____
__ __
I’ve tried to figure out what could be used as key for a external
identifier property. However, it looks like the registry does not
offer any (persistent) URL to its entries. So for looking up a
company, apparently there are two options:____
__ __
- conducting an extended search for the exact string “A&A
Dienstleistungsgesellschaft mbH“____
- copying the register number “32853” plus selecting the
court (Leipzig) from the according dropdown list and search that____
__ __
Both ways are not very intuitive, even if we can provide a link to
the search form. This would make a weak connection to the source
of information. Much more important, it makes disambiguation in
Mix-n-match difficult. This applies for the preparation of your
initial load (you would not want to create duplicates). But much
more so for everybody else who wants to match his or her data
later on. Being forced to search for entries manually in a
cumbersome way for disambiguation of a new, possibly large and
rich dataset is, in my eyes, not something we want to impose on
future contributors. And often, the free information they find in
the registry (formal name, register number, legal form, address)
will not easily match with the information they have (common name,
location, perhaps founding date, and most important sector of
trade), so disambiguation may still be difficult.____
__ __
Have you checked which parts of the accessible information as
below can be crawled and added legally to external databases such
as Wikidata?____
__ __
Cheers, Joachim____
__ __
--____
Joachim Neubert____
__ __
ZBW – German National Library of Economics____
Leibniz Information Centre for Economics____
Neuer Jungfernstieg 21
20354 Hamburg____
Phone +49-42834-462____
__ __
__ __
__ __
*Sebastian Hellmann
*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
organisations to Wikidata____
__ __
Hi all,____
the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e.
the info you are searching for at and clicking on UT (see example
below):____
https://www.handelsregister.de/rp_web/mask.do?Typ=e
<https://www.handelsregister.de/rp_web/mask.do?Typ=e> <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____
__ __
I would like to add this to Wikidata, either by crawling or by
raising money to use crowdsourcing concepts like crowdflour or
amazon turk. ____
__ __
https://www.wikidata.org/wiki/Wikidata:Notability
<https://www.wikidata.org/wiki/Wikidata:Notability> <https://www.wikidata.org/wiki/Wikidata:Notability>____
2. It refers to an instance of a *clearly identifiable
conceptual or material entity*. The entity must be notable, in
the sense that it *can be described using serious and publicly
available references*. If there is no item about you yet, you
are probably not notable.____
The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly
identifiable legal entities.
How can I get clearance to proceed on this?
All the best,
Sebastian____
__ __
__ __
Entity data____
__ __
Saxony District court *Leipzig HRB 32853 * – A&A
Dienstleistungsgesellschaft mbH ____
Legal status:____
Gesellschaft mit beschrÀnkter Haftung ____
Capital:____
25.000,00 EUR ____
Date of entry:____
29/08/2016
(When entering date of entry, wrong data input can occur due to
system failures!) ____
Date of removal:____
- ____
Balance sheet available: ____
- ____
Address (subject to correction):____
A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40____
04317 Leipzig ____
__ __
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org____
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata> <https://lists.wikimedia.org/mailman/listinfo/wikidata>
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata> <https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
_______________________________________________
_______________________________________________
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
Sebastian Hellmann
2017-10-16 16:11:27 UTC
Permalink
Hi Yaroslav,

in addition to this list, I added it here:

https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Handelsregister

and here:

https://www.wikidata.org/wiki/Wikidata:Project_chat#Handelsregister

but I received more and longer answers on this list.

All the best,

Sebastian
Post by Federico Morando
Dear All,
it is great that we are having this discussion, but may I please
suggest to have it on the RfP page on Wikidata? People already asked
similar questions there, and, in my experience, on-wiki discussion
will likely lead to refined request which will accomodate all suggestions.
Cheers
Yaroslav
On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann
ah, ok, sorry, I was assuming that Blazegraph would transitively
resolve this automatically.
Connect all existing organisations with the data from the
handelsregister. (No new identifiers added, we can start right now)
Add a constraint that all German organisations should be connected
to a court, i.e. the registering organisation as well as the id
assigned by the court.
@all: any properties I can reuse for this?
I will focus on this as it seems quite easy. We can first filter
orgs by other criteria, i.e. country as a blocking key and then
string match the rest.
Add all missing identifiers for the remaining orgs in
Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is
finished sufficiently.
I find Wikidata as such very hard to maintain as all data is
copied from somewhere else eventually, but Wikipedia has the same
problem. In the case of the German Business register, maintenance
is especially easy as the orgs are stable and uniquely
identifiable. Even the fact that a company gets shut down should
still be in Wikidata, so you have historical information. I mean,
you also keep the Roman Empire, the Hanse and even finished
projects in Wikidata. So even if an org ceases to exist, the entry
in Wikidata should stay.
# regarding Opencorporates
I have a critical opinion with Opencorporates. It appears to be
open, but you actually can not get the data. If somebody has a
data dump, please forward to me. Thanks.
More on top, I consider Opencorporates a danger to open data. It
appears to push open availability of data, but then it is limited
to open licenses. Usefulness is limited as there are no free dumps
and no possibility to duplicate it effectlively. Wikipedia and
Wikidata provide dumps and an API for exactly this reason.
Everytime somebody wants to create an open organisation dataset
with no barriers, the existence of Opencorporates is blocking this.
Cheers,
Sebastian
And
 my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).
So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl
It's easy to get these things wrong!
Antonin
Post by Antonin Delpeuch (lists)
Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).
So, to retrieve all items of a certain type in SPARQL, you need to use
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future

Antonin
Post by Sebastian Hellmann
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT>
<https://query.wikidata.org/#SELECT>
<https://query.wikidata.org/#SELECT>  %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we
choose the "instance of Business enterprise", we get 135570 results. And
I imagine there are many other categories that bring together commercial
companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
<https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D>
On the substance, the project to add all companies of a country would
make Wikidata a kind of totally free clone of Open Corporates
<https://opencorporates.com/> <https://opencorporates.com/>. I would of course be delighted to see
that, but is it not a challenge to maintain such a database? Companies
are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT>
<https://query.wikidata.org/#SELECT>
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find
and deduplicate
- The crawl can be done easily, a colleague has done so before.
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under
open license, I would probably not suggest loading it into wikidata,
as the data can be retrieved from the official source directly,
however, here this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e.
only facts is ok for wikidata. This done in a lot of places, I
guess. Same for Wikipedia, i.e. News articles and copyrighted books
are referenced. So Wikimedia or the Wikimedia community are experts
on this.
All the best,
Sebastian
Hi Sebastian,____
__ __
This is huge! It will cover almost all currently existing German
companies. Many of these will have similar names, so preparing for
disambiguation is a concern.____
__ __
A good way for such an approach would be proposing a property for
an external identifier, loading the data into Mix-n-match,
creating links for companies already in Wikidata, and adding the
rest (or perhaps only parts of them - I’m not sure if having all
of them in Wikidata makes sense, but that’s another discussion),
preferably with location and/or sector of trade in the description
field.____
__ __
I’ve tried to figure out what could be used as key for a external
identifier property. However, it looks like the registry does not
offer any (persistent) URL to its entries. So for looking up a
company, apparently there are two options:____
__ __
-          conducting an extended search for the exact string “A&A
Dienstleistungsgesellschaft mbH“____
-          copying the register number “32853” plus selecting the
court (Leipzig) from the according dropdown list and search that____
__ __
Both ways are not very intuitive, even if we can provide a link to
the search form. This would make a weak connection to the source
of information. Much more important, it makes disambiguation in
Mix-n-match difficult. This applies for the preparation of your
initial load (you would not want to create duplicates). But much
more so for everybody else who wants to match his or her data
later on. Being forced to search for entries manually in a
cumbersome way for disambiguation of a new, possibly large and
rich dataset is, in my eyes, not something we want to impose on
future contributors. And often, the free information they find in
the registry (formal name, register number, legal form, address)
will not easily match with the information they have (common name,
location, perhaps founding date, and most important sector of
trade), so disambiguation may still be difficult.____
__ __
Have you checked which parts of the accessible information as
below can be crawled and added legally to external databases such
as Wikidata?____
__ __
Cheers, Joachim____
__ __
--____
Joachim Neubert____
__ __
ZBW – German National Library of Economics____
Leibniz Information Centre for Economics____
Neuer Jungfernstieg 21
20354 Hamburg____
Phone +49-42834-462____
__ __
__ __
__ __
*Sebastian Hellmann
*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
organisations to Wikidata____
__ __
Hi all,____
the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e.
the info you are searching for at and clicking on UT (see example
below):____
https://www.handelsregister.de/rp_web/mask.do?Typ=e
<https://www.handelsregister.de/rp_web/mask.do?Typ=e>
<https://www.handelsregister.de/rp_web/mask.do?Typ=e>
<https://www.handelsregister.de/rp_web/mask.do?Typ=e>____
__ __
I would like to add this to Wikidata, either by crawling or by
raising money to use crowdsourcing concepts like crowdflour or
amazon turk. ____
__ __
https://www.wikidata.org/wiki/Wikidata:Notability
<https://www.wikidata.org/wiki/Wikidata:Notability>
<https://www.wikidata.org/wiki/Wikidata:Notability>
<https://www.wikidata.org/wiki/Wikidata:Notability>____
2. It refers to an instance of a *clearly identifiable
conceptual or material entity*. The entity must be notable, in
the sense that it *can be described using serious and publicly
available references*. If there is no item about you yet, you
are probably not notable.____
The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly
identifiable legal entities.
How can I get clearance to proceed on this?
All the best,
Sebastian____
__ __
__ __
Entity data____
__ __
Saxony District court *Leipzig HRB 32853 * – A&A
Dienstleistungsgesellschaft mbH ____
Legal status:____
Gesellschaft mit beschrÀnkter Haftung  ____
Capital:____
25.000,00 EUR ____
Date of entry:____
29/08/2016
(When entering date of entry, wrong data input can occur due to
system failures!) ____
Date of removal:____
- ____
Balance sheet available: ____
- ____
Address (subject to correction):____
A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40____
04317 Leipzig ____
__ __
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects:http://dbpedia.org,http://nlp2rdf.org,
http://linguistics.okfn.org,https://www.w3.org/community/ld4lt
<https://www.w3.org/community/ld4lt>
<http://www.w3.org/community/ld4lt>
<http://www.w3.org/community/ld4lt>
Homepage:http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
<http://aksw.org/SebastianHellmann>
<http://aksw.org/SebastianHellmann>
Research Group:http://aksw.org____
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects:http://dbpedia.org,http://nlp2rdf.org,
http://linguistics.okfn.org,https://www.w3.org/community/ld4lt
<https://www.w3.org/community/ld4lt>
<http://www.w3.org/community/ld4lt>
<http://www.w3.org/community/ld4lt>
Homepage:http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
<http://aksw.org/SebastianHellmann>
<http://aksw.org/SebastianHellmann>
Research Group:http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
Sebastian Hellmann
2017-10-16 23:37:40 UTC
Permalink
Ok, I put some effort into
https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Handelsregister
to move the discussion there.

All the best,

Sebastian
Post by Federico Morando
Dear All,
it is great that we are having this discussion, but may I please
suggest to have it on the RfP page on Wikidata? People already asked
similar questions there, and, in my experience, on-wiki discussion
will likely lead to refined request which will accomodate all suggestions.
Cheers
Yaroslav
On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann
ah, ok, sorry, I was assuming that Blazegraph would transitively
resolve this automatically.
Connect all existing organisations with the data from the
handelsregister. (No new identifiers added, we can start right now)
Add a constraint that all German organisations should be connected
to a court, i.e. the registering organisation as well as the id
assigned by the court.
@all: any properties I can reuse for this?
I will focus on this as it seems quite easy. We can first filter
orgs by other criteria, i.e. country as a blocking key and then
string match the rest.
Add all missing identifiers for the remaining orgs in
Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is
finished sufficiently.
I find Wikidata as such very hard to maintain as all data is
copied from somewhere else eventually, but Wikipedia has the same
problem. In the case of the German Business register, maintenance
is especially easy as the orgs are stable and uniquely
identifiable. Even the fact that a company gets shut down should
still be in Wikidata, so you have historical information. I mean,
you also keep the Roman Empire, the Hanse and even finished
projects in Wikidata. So even if an org ceases to exist, the entry
in Wikidata should stay.
# regarding Opencorporates
I have a critical opinion with Opencorporates. It appears to be
open, but you actually can not get the data. If somebody has a
data dump, please forward to me. Thanks.
More on top, I consider Opencorporates a danger to open data. It
appears to push open availability of data, but then it is limited
to open licenses. Usefulness is limited as there are no free dumps
and no possibility to duplicate it effectlively. Wikipedia and
Wikidata provide dumps and an API for exactly this reason.
Everytime somebody wants to create an open organisation dataset
with no barriers, the existence of Opencorporates is blocking this.
Cheers,
Sebastian
And
 my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).
So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl
It's easy to get these things wrong!
Antonin
Post by Antonin Delpeuch (lists)
Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).
So, to retrieve all items of a certain type in SPARQL, you need to use
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future

Antonin
Post by Sebastian Hellmann
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT>
<https://query.wikidata.org/#SELECT>
<https://query.wikidata.org/#SELECT>  %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we
choose the "instance of Business enterprise", we get 135570 results. And
I imagine there are many other categories that bring together commercial
companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
<https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D>
On the substance, the project to add all companies of a country would
make Wikidata a kind of totally free clone of Open Corporates
<https://opencorporates.com/> <https://opencorporates.com/>. I would of course be delighted to see
that, but is it not a challenge to maintain such a database? Companies
are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT>
<https://query.wikidata.org/#SELECT>
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find
and deduplicate
- The crawl can be done easily, a colleague has done so before.
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under
open license, I would probably not suggest loading it into wikidata,
as the data can be retrieved from the official source directly,
however, here this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e.
only facts is ok for wikidata. This done in a lot of places, I
guess. Same for Wikipedia, i.e. News articles and copyrighted books
are referenced. So Wikimedia or the Wikimedia community are experts
on this.
All the best,
Sebastian
Hi Sebastian,____
__ __
This is huge! It will cover almost all currently existing German
companies. Many of these will have similar names, so preparing for
disambiguation is a concern.____
__ __
A good way for such an approach would be proposing a property for
an external identifier, loading the data into Mix-n-match,
creating links for companies already in Wikidata, and adding the
rest (or perhaps only parts of them - I’m not sure if having all
of them in Wikidata makes sense, but that’s another discussion),
preferably with location and/or sector of trade in the description
field.____
__ __
I’ve tried to figure out what could be used as key for a external
identifier property. However, it looks like the registry does not
offer any (persistent) URL to its entries. So for looking up a
company, apparently there are two options:____
__ __
-          conducting an extended search for the exact string “A&A
Dienstleistungsgesellschaft mbH“____
-          copying the register number “32853” plus selecting the
court (Leipzig) from the according dropdown list and search that____
__ __
Both ways are not very intuitive, even if we can provide a link to
the search form. This would make a weak connection to the source
of information. Much more important, it makes disambiguation in
Mix-n-match difficult. This applies for the preparation of your
initial load (you would not want to create duplicates). But much
more so for everybody else who wants to match his or her data
later on. Being forced to search for entries manually in a
cumbersome way for disambiguation of a new, possibly large and
rich dataset is, in my eyes, not something we want to impose on
future contributors. And often, the free information they find in
the registry (formal name, register number, legal form, address)
will not easily match with the information they have (common name,
location, perhaps founding date, and most important sector of
trade), so disambiguation may still be difficult.____
__ __
Have you checked which parts of the accessible information as
below can be crawled and added legally to external databases such
as Wikidata?____
__ __
Cheers, Joachim____
__ __
--____
Joachim Neubert____
__ __
ZBW – German National Library of Economics____
Leibniz Information Centre for Economics____
Neuer Jungfernstieg 21
20354 Hamburg____
Phone +49-42834-462____
__ __
__ __
__ __
*Sebastian Hellmann
*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
organisations to Wikidata____
__ __
Hi all,____
the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e.
the info you are searching for at and clicking on UT (see example
below):____
https://www.handelsregister.de/rp_web/mask.do?Typ=e
<https://www.handelsregister.de/rp_web/mask.do?Typ=e>
<https://www.handelsregister.de/rp_web/mask.do?Typ=e>
<https://www.handelsregister.de/rp_web/mask.do?Typ=e>____
__ __
I would like to add this to Wikidata, either by crawling or by
raising money to use crowdsourcing concepts like crowdflour or
amazon turk. ____
__ __
https://www.wikidata.org/wiki/Wikidata:Notability
<https://www.wikidata.org/wiki/Wikidata:Notability>
<https://www.wikidata.org/wiki/Wikidata:Notability>
<https://www.wikidata.org/wiki/Wikidata:Notability>____
2. It refers to an instance of a *clearly identifiable
conceptual or material entity*. The entity must be notable, in
the sense that it *can be described using serious and publicly
available references*. If there is no item about you yet, you
are probably not notable.____
The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly
identifiable legal entities.
How can I get clearance to proceed on this?
All the best,
Sebastian____
__ __
__ __
Entity data____
__ __
Saxony District court *Leipzig HRB 32853 * – A&A
Dienstleistungsgesellschaft mbH ____
Legal status:____
Gesellschaft mit beschrÀnkter Haftung  ____
Capital:____
25.000,00 EUR ____
Date of entry:____
29/08/2016
(When entering date of entry, wrong data input can occur due to
system failures!) ____
Date of removal:____
- ____
Balance sheet available: ____
- ____
Address (subject to correction):____
A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40____
04317 Leipzig ____
__ __
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects:http://dbpedia.org,http://nlp2rdf.org,
http://linguistics.okfn.org,https://www.w3.org/community/ld4lt
<https://www.w3.org/community/ld4lt>
<http://www.w3.org/community/ld4lt>
<http://www.w3.org/community/ld4lt>
Homepage:http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
<http://aksw.org/SebastianHellmann>
<http://aksw.org/SebastianHellmann>
Research Group:http://aksw.org____
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects:http://dbpedia.org,http://nlp2rdf.org,
http://linguistics.okfn.org,https://www.w3.org/community/ld4lt
<https://www.w3.org/community/ld4lt>
<http://www.w3.org/community/ld4lt>
<http://www.w3.org/community/ld4lt>
Homepage:http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
<http://aksw.org/SebastianHellmann>
<http://aksw.org/SebastianHellmann>
Research Group:http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
Research Group: http://aksw.org
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
Luigi Assom
2017-10-19 12:07:33 UTC
Permalink
Hi,

I would like to join thread I found in the archive:
https://lists.wikimedia.org/pipermail/wikidata//2017-October/011259.html

I worked in contextual research to facilitate knowledge transfer.

One of the domain I would like to treat is visualisation of economics
networks.

I seek for an impact over governance of innovation and transparency over
economics network control, and allow also SMEs companies or private
citizens to build their analytics and prevent cases of collusions.

Information about business profiles is currently a premium service provided
by private specialised corporations, although much of the information about
companies is public, but there is lack of open data policy.

I would like to fill the gap and contribute to feed Wikidata as repository,
either in bulk either as a collective action - as a design thinker I could
contribute to design processes to fill in data, like applications that
facilitate the process.

*Is there any guidance or clearance about this initiatives?*

I am happy to read similar interest from Germany, Belgium and Italy, I
would like to connect.

I read that feeding wikidata with corporate information would significantly
increase the size - though, I think that the benefit to allow to inquire
for public governance would allow to distribute governance of economics
data.

Aside of public services like:
https://www.gov.uk/government/organisations/companies-house

I would like to allow data-visualisation researchers (as myself) to uncover
for the public results like:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0025995

that relies on private parternships to access corporate databases, and so
findings cannot be quieried by the public.

*Is there a specific Wikidata policy to comply with to feed data from
scrapers of websites?*

As a starter, the URI of sites with good reputation could act as an
*identifier*.
I believe that scraping would be legit for information about property
"facts" (below) are public, and organisations that collated data provides
services (as professional communities or services augmented with private
data) that would be not in competition with building a repository.

In a way, I see wikidata as possibility to indexing data that can be
functional to search engines and discovery engines, and indexing data is an
activity that is daily run by such services. I believe that enabling public
transparency would enhance open-data services.


Below some properties of interest.




Properties I would be interested in are:
- TEAM (founders)
- DESCRIPTION (corporate description over products and services)
- INVESTORS (corp. and private equity)
- EMPLOYEES / INCUBATORS / ADVISORS (personal information available as
public information over the web)
- PARTICIPATED COMPANIES
- DATE of acquisition or participation to companies
- CAPITAL (if available, or in ranges)
- VAT NUMBER (or registry number)
- ADDRESS

Other ideas to fetch the business profile of companies?
It should be, somehow, publicly available, for each corporate report to the
organisation registry and there are already private companies offering
analytics over the business profiles.



Luigi
Thad Guidry
2017-10-19 12:17:18 UTC
Permalink
Hi Luigi,

Have you looked at https://opencorporates.com ?

Thad
+ThadGuidry <https://www.google.com/+ThadGuidry>
Luigi Assom
2017-10-19 16:07:59 UTC
Permalink
Hi Thad,
* # regarding Opencorporates *>* I have a critical opinion with
Opencorporates. It appears to be *>* open, but you actually can not get
the data. If somebody has a *>* data dump, please forward to me. Thanks. *
* More on top, I consider Opencorporates a danger to open data. It *>*
appears to push open availability of data, but then it is limited *>* to
open licenses. Usefulness is limited as there are no free dumps *>* and
no possibility to duplicate it effectlively. Wikipedia and *>* Wikidata
provide dumps and an API for exactly this reason. *>* Everytime somebody
wants to create an open organisation dataset *>* with no barriers, the
existence of Opencorporates is blocking this.*
I think that having the possibility to make an analysis on bulk is
important.

Some data in opencorporates are incomplete - like founders, capital raised,
investors, despite some info is fed from users.
Currently most data is about US and NZ, Id like t see EU more represented.

I would like to have possibility to visualise a network of companies and
their participations.
And build bypartite graphs between personas and companies.
I will try to reach them, about cooperation for such a project.

Do you have connections with them?
Hi Luigi,
Have you looked at https://opencorporates.com ?
Thad
+ThadGuidry <https://www.google.com/+ThadGuidry>
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
Thad Guidry
2017-10-19 17:06:22 UTC
Permalink
No connections to Opencorporates, sorry.

The good news is that the data sources in Opencorporates (the Registers)
are accessible to you...sometimes in dump format.

https://opencorporates.com/registers

Hope that helps you further in your research and needs. I am not saying
its easy :)

-Thad
Jakob Voß
2017-10-25 07:44:39 UTC
Permalink
Hi Luigi,

I favour cooperation with OpenCorporates instead of independently adding
lots of company record to Wikidata. Sure there are parallel strategies
but any effort should also include OpenCorporates to some degree.

OpenCorporates is licensed under ODbL (just added this referenced
statement to Q7095760) and we have property P1320 to link Wikidata and
OpenCorporates. A first step would be to align

https://opencorporates.com/registers
https://en.wikipedia.org/wiki/List_of_company_registers

Right now we have 18 instances of company register (Q1394657) and its
subclasses explicitly classified as such in Wikidata.

These items should be linked with the registers listed at
OpenCorporates, e.g.

UK Companies House (Q257303)
= https://opencorporates.com/registers/270

I've also noticed that OpenCorporates has a field for "Identifiers"
where Wikidata QIDs may be included to have two-way-links between the
two datasets.

Anyway, better contact https://opencorporates.com/info/contributing at
least to let them know about your plans.

Cheers,
Jakob
--
Jakob Voß <***@gbv.de>
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de/
Laura Morales
2017-10-25 08:17:56 UTC
Permalink
Is there any RDF dump available of OpenCorporates data? Or even any dump at all? Their licensing terms are ambiguous... They say it's released under ODbL, but if I want to use the data I have to ask permission and they will decide if I can use it for free or if I have to pay a fee :/
 
 

Sent: Wednesday, October 25, 2017 at 9:44 AM
From: "Jakob Voß" <***@gbv.de>
To: ***@lists.wikimedia.org
Subject: Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata
Hi Luigi,

I favour cooperation with OpenCorporates instead of independently adding
lots of company record to Wikidata. Sure there are parallel strategies
but any effort should also include OpenCorporates to some degree.

OpenCorporates is licensed under ODbL (just added this referenced
statement to Q7095760) and we have property P1320 to link Wikidata and
OpenCorporates. A first step would be to align

https://opencorporates.com/registers
https://en.wikipedia.org/wiki/List_of_company_registers

Right now we have 18 instances of company register (Q1394657) and its
subclasses explicitly classified as such in Wikidata.

These items should be linked with the registers listed at
OpenCorporates, e.g.

UK Companies House (Q257303)
= https://opencorporates.com/registers/270[https://opencorporates.com/registers/270]

I've also noticed that OpenCorporates has a field for "Identifiers"
where Wikidata QIDs may be included to have two-way-links between the
two datasets.

Anyway, better contact https://opencorporates.com/info/contributing[https://opencorporates.com/info/contributing] at
least to let them know about your plans.

Cheers,
Jakob

--
Jakob Voß <***@gbv.de>
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de/[http://www.gbv.de/]

_______________________________________________
Wikidata mailing list
***@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]
Thad Guidry
2017-10-25 15:06:58 UTC
Permalink
Laura,

Talk to OpenCorporates and ask those questions yourself.
Get involved ! :)

-Thad
+ThadGuidry <https://plus.google.com/+ThadGuidry>
Post by Laura Morales
Is there any RDF dump available of OpenCorporates data? Or even any dump
at all? Their licensing terms are ambiguous... They say it's released under
ODbL, but if I want to use the data I have to ask permission and they will
decide if I can use it for free or if I have to pay a fee :/
Laura Morales
2017-10-26 12:40:04 UTC
Permalink
OK, just asked. Their reply was that they "reserves the right under paragraph 3.3 of ODbL to release the database under different terms", which is to say their data is NOT free because they want to control how and where the data is used.
Are we starting to see "free vs open" all over again, this time with data instead of software?
 
 

Sent: Wednesday, October 25, 2017 at 5:06 PM
From: "Thad Guidry" <***@gmail.com>
To: "Discussion list for the Wikidata project." <***@lists.wikimedia.org>
Subject: Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

Laura,
 
Talk to OpenCorporates and ask those questions yourself.
Get involved ! :)
 

-Thad
+ThadGuidry[https://plus.google.com/+ThadGuidry]
  

On Wed, Oct 25, 2017 at 3:22 AM Laura Morales <***@mail.com[mailto:***@mail.com]> wrote:Is there any RDF dump available of OpenCorporates data? Or even any dump at all? Their licensing terms are ambiguous... They say it's released under ODbL, but if I want to use the data I have to ask permission and they will decide if I can use it for free or if I have to pay a fee :/
 _______________________________________________ Wikidata mailing list ***@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]
Luigi Assom
2017-10-26 13:21:03 UTC
Permalink
I think Laura raised a very good point here.
One question, broader :

is Wikidata team thinking about moving their dataset over *block-chain*?
That, I do believe, would incentivise people to participate, maintain, and
even craft useful thing with clear licenses (eventually profiting based on
utility of a thing ) .


Again, the possibility to compute / process things that are of public
utility / governance depends on accessibliity to data.
Or, very large computation / legal / negitation power of a few stakeholders.

Making data accessisble from multiple stakehodlers from public audiences,
or at least mirroring the meta-data, would allow a sane "competition" /
collaboration in governance - with also concrete applications saving $$$ ,
like preventing companies collusions.

I also tried to connect with OpenCorporates - research and CEO.
No answer so far.
Post by Laura Morales
OK, just asked. Their reply was that they "reserves the right under
paragraph 3.3 of ODbL to release the database under different terms", which
is to say their data is NOT free because they want to control how and where
the data is used.
Are we starting to see "free vs open" all over again, this time with data
instead of software?
Sent: Wednesday, October 25, 2017 at 5:06 PM
To: "Discussion list for the Wikidata project." <
Subject: Re: [Wikidata] Kickstartet: Adding 2.2 million German
organisations to Wikidata
Laura,
Talk to OpenCorporates and ask those questions yourself.
Get involved ! :)
-Thad
+ThadGuidry[https://plus.google.com/+ThadGuidry]
OpenCorporates data? Or even any dump at all? Their licensing terms are
ambiguous... They say it's released under ODbL, but if I want to use the
data I have to ask permission and they will decide if I can use it for free
or if I have to pay a fee :/
_______________________________________________ Wikidata mailing list
mailman/listinfo/wikidata[https://lists.wikimedia.org/
mailman/listinfo/wikidata]
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
Jakob Voß
2017-10-27 07:14:47 UTC
Permalink
Post by Laura Morales
OK, just asked. Their reply was that they "reserves the right under
paragraph 3.3 of ODbL to release the database under different terms",
which is to say their data is NOT free because they want to control
how and where the data is used. Are we starting to see "free vs open"
all over again, this time with data instead of software?
This means we could re-publish the data openly once we actually get it
but they make it hard to get their data :-(

I'd still try to be open about OpenCorporates and keep on asking them.
If they don't switch to more open data sharing, they will likely be
replaced, that's for sure. So work independently from OpenCorporates but
keep compatible unless they actively reject to work with Wikidata in any
way.

Cheers,
Jakob
--
Jakob Voß <***@gbv.de>
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de/
Luigi Assom
2017-10-27 09:38:03 UTC
Permalink
As a general question, is *it discouraged or encouraged to mirror corporate
data on Wikidata as public repository?*

*Could you provide a bullet list of why discouraged?*

*How does the decision process work?*
https://www.wikidata.org/wiki/Wikidata:Introduction
Data is entered and maintained* by Wikidata editors*, who decide on the
rules of content creation and management.
*A secondary database.* Wikidata records not just statements, but also
their sources, and connections to other databases.
*According to wikidata editors, is it possible to index web sources and
collate their data on WD?*How do they deal with bulks or pieces of data
that may be provided by users ?

Indexing the web does not require agreements, since any web crowler of
search engines works indeed like that.

Here, a crowd of people can coordinate themselve to create a *consistent*
database.

I believe consistency is a key to serve *Anyone in the world.*
Anyone can use Wikidata for any number of different ways by using its
application programming interface.
I think applications that have "value" in the sense of corporate datasets
can be built over data including business profiles and ownership towards
other participated /subsidiaries companies and stakeholders who
participate in the business.

Imagine a minimised version of Bloomberg of Bureau Van Dijk, free to serve
* a**nyone in the world.*

*****

I think I could contribute in three ways:

- collecting data
- designing test-application to facilitate crowd-sourced addition of
data
- providing a simplified guide to treat Wikidata properties on a
specific case (a kind of info-graphic, but need very clear guidance in the
entities and properties for corporates).
OK, just asked. Their reply was that they "reserves the right under
Post by Laura Morales
paragraph 3.3 of ODbL to release the database under different terms",
which is to say their data is NOT free because they want to control
how and where the data is used. Are we starting to see "free vs open"
all over again, this time with data instead of software?
This means we could re-publish the data openly once we actually get it but
they make it hard to get their data :-(
I'd still try to be open about OpenCorporates and keep on asking them. If
they don't switch to more open data sharing, they will likely be replaced,
that's for sure. So work independently from OpenCorporates but keep
compatible unless they actively reject to work with Wikidata in any way.
Cheers,
Jakob
--
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de/
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
Loading...