Discussion:
[Wikidata] Raising alerting threshold for Wikdiata Query Service updater lag
Guillaume Lederrey
2018-11-02 13:33:46 UTC
Permalink
Hello all!

TL;DR: alert level on Wikidata Query Service have been increased, any
Icinga alert should now be treated seriously.

As you might know already, we're having trouble keeping up on updates
on the public Wikidata Query Service cluster. We're working on it, but
it is a hard problem. At the same time, known use cases of the public
WDQS endpoint don't depend on a short update lag.

As such, we have increased the alerting threshold on update lag for
this public cluster to 6h / 12h for WARNING / CRITICAL [1]. This does
not actually change the quality of service of WDQS public endpoints,
but somewhat aligns expectations and reality. It also means that all
alerts raised by WDQS should now be treated seriously and not ignored
as known issues with no immediate solution.

At the same time, we're having a conversation of what the service
level of that cluster should be [2]. Feel free to join that
conversation if you are impacted (or just if you have interesting
thoughts on the subject).

Thanks for your patience,

Guillaume


[1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/470819
[2] https://phabricator.wikimedia.org/T199228

--
Guillaume Lederrey
Operations Engineer, Search Platform
Wikimedia Foundation
UTC+1 / CET
Gerard Meijssen
2018-11-03 16:58:23 UTC
Permalink
Hoi,
We have found that duplicate items are created for publications. and so far
the only reason identified is that the lagtime before new data becomes
available is so bad. The notion that there is no practical impact is
therefore wrong.

When WDQS results are considered only for use outside Wikidata fine.
However as WDQS is used for the development of new and improved data you
basically indicate / accept failure by changing / accepting the status quo.
Thanks,
Gerard
Post by Guillaume Lederrey
Hello all!
TL;DR: alert level on Wikidata Query Service have been increased, any
Icinga alert should now be treated seriously.
As you might know already, we're having trouble keeping up on updates
on the public Wikidata Query Service cluster. We're working on it, but
it is a hard problem. At the same time, known use cases of the public
WDQS endpoint don't depend on a short update lag.
As such, we have increased the alerting threshold on update lag for
this public cluster to 6h / 12h for WARNING / CRITICAL [1]. This does
not actually change the quality of service of WDQS public endpoints,
but somewhat aligns expectations and reality. It also means that all
alerts raised by WDQS should now be treated seriously and not ignored
as known issues with no immediate solution.
At the same time, we're having a conversation of what the service
level of that cluster should be [2]. Feel free to join that
conversation if you are impacted (or just if you have interesting
thoughts on the subject).
Thanks for your patience,
Guillaume
[1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/470819
[2] https://phabricator.wikimedia.org/T199228
--
Guillaume Lederrey
Operations Engineer, Search Platform
Wikimedia Foundation
UTC+1 / CET
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
Guillaume Lederrey
2018-11-05 18:24:34 UTC
Permalink
On Sat, Nov 3, 2018 at 5:58 PM Gerard Meijssen
Hoi,
We have found that duplicate items are created for publications. and so far the only reason identified is that the lagtime before new data becomes available is so bad. The notion that there is no practical impact is therefore wrong.
When WDQS results are considered only for use outside Wikidata fine. However as WDQS is used for the development of new and improved data you basically indicate / accept failure by changing / accepting the status quo.
Thanks,
Gerard
Thanks for the precision! There are more discussions on the subject in
the related phab task [1]. We don't have a good solution yet, but your
input on that task would be appreciated, if only to make your use
cases visible.

Have fun!

Guillaume

[1] https://phabricator.wikimedia.org/T199228
Post by Guillaume Lederrey
Hello all!
TL;DR: alert level on Wikidata Query Service have been increased, any
Icinga alert should now be treated seriously.
As you might know already, we're having trouble keeping up on updates
on the public Wikidata Query Service cluster. We're working on it, but
it is a hard problem. At the same time, known use cases of the public
WDQS endpoint don't depend on a short update lag.
As such, we have increased the alerting threshold on update lag for
this public cluster to 6h / 12h for WARNING / CRITICAL [1]. This does
not actually change the quality of service of WDQS public endpoints,
but somewhat aligns expectations and reality. It also means that all
alerts raised by WDQS should now be treated seriously and not ignored
as known issues with no immediate solution.
At the same time, we're having a conversation of what the service
level of that cluster should be [2]. Feel free to join that
conversation if you are impacted (or just if you have interesting
thoughts on the subject).
Thanks for your patience,
Guillaume
[1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/470819
[2] https://phabricator.wikimedia.org/T199228
--
Guillaume Lederrey
Operations Engineer, Search Platform
Wikimedia Foundation
UTC+1 / CET
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Guillaume Lederrey
Operations Engineer, Search Platform
Wikimedia Foundation
UTC+2 / CEST
Loading...