Posted by jive
on October 7, 2005 at 8:51 PM PDT
Sonny Parafina recently derided the state of extent information in the free data available on the web. Lets see what can be done despite the data providers.
Sonny Parafina recently derided the state of extent information in the free data available on the web. Let's see what can be done despite the data providers.
One of the great turning points of the web (yes I remember) was when we stopped trusting data providers. Seriously keywords were only filled out by those seeking to mislead you, or a few keeners.
Spatial data today is in the same boat - 90% of the searches you do for anything turn up the "Important Bird Areas". Why is this .. because the provider of that data got annoyed, so annoyed he actually filled in all his metadata (and keywords), and now shows up all the time. Search for water? You get information about water birds, search for Canada, you get information about our wonderful Owl population.
enjoy that - it is your token java content
This is silly, this is the web of 1993. Only now we have maps.
As talked about in a previous blog, the big three search engines are just cottening on to the power of location. It will be a bit before the start to help us with this problem.
I should mention the nice email that started this thinking (from Sonny Parafina):
For those folks who are setting up public OGC WMS/WFS/WCS services, I have a request:
PLEASE STOP SETTING YOUR BOUNDING BOX TO THE ENTIRE WORLD!
(unless you actually are providing coverage for the world)
I agree - lets see what can be done.... but first some background information.
How is Spatial Data Published
Spatial clients often work directly with a single URL (this is called a Capabilities URL, and the result is a GetCapabilities document describing the service, and the data provided). It is this document that is often inaccurate, or more specifically not usefully accurate.
Clients use this Capabilities Document as a "table of contents", in addition to providing information about available data it also tells you what URLs can be used to obtain the data. As an example Web Map Servers will have an entry for their GetMap request.
Can we trust Data Providers
Um - apparently not. Seriously data providers are kind of done at the point where they have collection information. They know what it was good for, we are kind of lucky they decided to publish it at all.
One thing we can do is go back the the Yahoo of old, filled with hapless people looking over the pages and treating everything like a library.
Yes - a manual process may tide us over until something smart comes along. It is after all how this stuff is cataloged right now ...
Back to Sonny's email:
The reason is that it makes spatial queries in a Catalog absolutely
useless. For example, I loaded Refractions Research list of OGC
services into our catalog, a query for roads with a world bounding
box returned ~220 results, the same query with a bounding box around
Boston yielded ~210 results.
That is right he is hitting a Refractions service for this information, and that is constructed by a series of targeted searches of good old google .
Yahoo for Spatial Information
And this is the idea - construct a "metadata look-aside service". One that all us client apps can consult for a second opinion, basically it would serve up user supplied corrections for specific WMS/WFS/WCS Layer information.
There are some technical difficulties, these standard based web services are not the best at marking their generated content with the appropriate HTTP modification information, and the standards themselves only recently started sticking in generational version information (so you can tell when the document contents changed).
Still we need to think of something to trust. If not the capabilities document itself ... perhaps we should just record the original bounding box and the correction. As long as the service is showing us the original value we will know that they have not fixed the problem ... and our correction is still "good". Actually lets not say good lets say "better".
Social hacking time - why would people do this? At least the users of data have an interest in fixing the problem (by definition) and would probably like a way to "fix" the data (by the convention of using this look-aside service).
How about the server providers? If we record the number of hits to this look-aside service we could at least provide the data providers with a figure of how many users they are annoying, along with an email of the fix ... that is not bad.
The alternative is giving users a "you suck" button set up to email based on the contact information. Oh wait we can't trust that either ...
This idea does have a draw back, some WMS server providers (using software by a nameless commercial vendor) have an annoying habbit of "Cascading" layers. This means that the same (possibly wrong) information will be trickled around the web.
This look-aside service would annoy those with cascading WMS servers, as they would get bothered about extent information they have no direct control over. But then again they annoy me (often by removing their cascade layers the moment there is a problem, and then pretending they never heard of the information they knew a few moments before...).
Still there is something here ... fight fire with fire.
Remember how spatial data is published? Those are XML documents and can be hacked by the daring.
Since these are well understood formats, we could make our look-aside service take the URL to the capabilities document. If it is known it can patch the document as it is returned, if it is unknown it can pass it through unmodified ...
So how about some social motivation? This does sounds like a great way to lazily generate a catalog of layers ... so a catalog provider should have motivation. How about data users, why should they supply the corrections? Well they do use the data, and this information does add value to the data making it more useful to them - that is a good trade.
Any more technical issues? A few clients would not enjoy the experience. Very early clients expect to hack at a single URL and provide different request parameters. That is they don't expect a separation between entry points for the capabilities document, and getMap requests, etc ... I am pretty sure we have disappointed that assumption often enough on the wibbly wobbly web by now that these issues have been fixed.
Sounds like we have an idea? Any takers ....