"Poorly structured" HTML is not all that bad in 2018 thanks to HTML 5
(which builds the "rendering decisions made about broken HTML from
Netscape 3" into the standard so that in common languages you can get
the same DOM tree as the browser)
If you try to use an official or unofficial API to fetch data from some
service in 2018 you will have to add some dependencies and you just
might open a can of whoop-ass that will make you reinstall Anconda or
maybe you will learn something you'll never be able to unlearn about how
XML processing changed between two minor versions of the JDK
On the other hand I have often dusted off the old HTML-based parser I
made for Flickr and found I could get it to work for other media
collections, blogs, etc. by just changing the "semantic model" embodied
in the application which could be as simple as some function or object
that knows something about the structure of the URLs some documents.
I cannot understand why so many standards have been pushed to integrate
RDF and HTML that have gone nowhere but nobody has promoted the clean
solution of "add a css media type for RDF" that marks the semantics of
HTML up the way JSON-LD works.
Often though if you look it that way much of the time these days
matching patterns against CSS gets you most of the way there.
I've had cases where I haven't had to change the rule sets much at all
but none of them have been more than 50 lines of code, all much less.
------ Original Message ------
From: "Federico Leva (Nemo)" <***@gmail.com>
To: "Discussion list for the Wikidata project"
<***@lists.wikimedia.org>; "Ettore RIZZA" <***@gmail.com>
Sent: 9/26/2018 1:00:53 PM
Subject: Re: [Wikidata] Looking for "data quality check" bots
Post by Federico Leva (Nemo)
Post by Ettore RIZZA
I'm looking for Wikidata bots that perform accuracy audits. For
example, comparing the birth dates of persons with the same date
indicated in databases linked to the item by an external-id.
This is mostly a screenscraping job, because most external databases
are only accessibly in unstructured or poorly structured HTML form.
Wikidata mailing list