It's nice to see a big ideas like Web Services settle down and get married to just how things are done.
It wasn't too long ago it was out there pounding the pavement just to get noticed. But like all good XML based ideas, it didn't take long to be taken for granted.
With the current generation of services where RSS and AJAX are replacing SOAP and REST things are getting easier and easier.
And with XQuery and MarkLogic Server in the picture . . . well it might be time for Web Services to put on it's services are part of the web slippers and pick up the background technology newsletter.
At it's core the idea is simple: you make something that outputs XML, I make a request with some parameters . . . and we are sharing data and content.
In the beginning, it got really complex because just doing that simple task was fairly hard. You had to build a data layer, some application logic, a web layer, an output layer and an input layer.
Best to standardize what we can (hence SOAP and REST) so we can invest all this effort once.
Fair enough - but in the back rooms of the tech labs, everyone was writing little, no-protocol web services that . . . well they output XML from a request.
This was one step beyond old fashioned web scraping . . . and like that old idea it worked and kept on working. If you look closely, the simple web services are how most things are done these days.
But can't it be simpler than using the XML add-ons to Java?
And what about all the content and data NOT exposed as nice XML?
Enter MarkLogic XQuery - the Best Scraper on the Web.
Using XQuery and MarkLogic Server, this one line a web service client (execute this is cq as set up in the tutorial):
(: get a list of titles form the Art and Design RSS feed of the New York Times :)
(: use the MarkLogic Server http-get built-in :)
xdmp:http-get("http://www.nytimes.com/services/xml/rss/nyt/ArtandDesign.xml")
You can then do what you want with that XML content returned:
(: make a list of the items and create links :)
<ul>{
for $item in xdmp:http-get("http://www.nytimes.com/services/xml/rss/nyt/ArtandDesign.xml")//item
return
<li>
<a href="{fn:string($item/link)}">{fn:string($item/title)}</a>
</li>
}</ul>
and you have the links from the RSS:
- Art: Swimming With Famous Dead Sharks
- Art: Now in Moving Pictures: The Multitudes of Nikki S. Lee
- The Paris of Brassaï Goes on Sale
- Martha Holmes, 83, Pioneer in Photography, Dies
- Art Review | 'Picasso and American Art': Everybody Loves Pablo
- A Mezzanine Done Over in Bricks, Evocative and Immediate
But the real power of MarkLogic Server is what you can do on actual web pages - how about getting an image from google image search?
(: tidy returns in the xhtml namespace :)
declare namespace html="http://www.w3.org/1999/xhtml"
(: lets do a search for xquery images :)
let $search := "xquery"
(: well just call the google image search directly :)
let $request-url :=
fn:concat("http://images.google.com/images?q=",$search,"&hl=en&btnG=Search+Images")
(: make the request :)
let $request-results := xdmp:http-get($request-url)
(: take the second node of the results - the returned content and use tidy to make it XML :)
let $tidy-results := xdmp:tidy($request-results[2])
(: use XPath to return the first image :)
(: we determined the XPath by opening studying $tidy-results :)
(: tidy returns elements in the xhtml namespace - so we also use that :)
let $first-image := ($tidy-results/html:html//html:table/html:tr/html:td/html:a/html:img[@width])[1]
let $img-src := fn:data($first-image/@src)
let $google-src := fn:concat("http://images.google.com",$img-src)
return
<img src="{$google-src}/>
For this image on the data + processing model of XQuery:
Or, how about a list of Blogging terms from wikipedia?
(: tidy returns in the xhtml namespace :)
declare namespace html="http://www.w3.org/1999/xhtml"
(: grab the blogging terms page :)
let $term-page := xdmp:tidy(
xdmp:http-get('http://en.wikipedia.org/wiki/List_of_blogging_terms')[2]
)[2]
(: using the dt elements, get a list of terms :)
for $term in $term-page//html:dt
return
<blog-term>
<term>{fn:string($term/html:a)}</term>
<link>http://en.wikipedia.org{fn:data($term/@href)}</link>
<def>{fn:string(($term/following-sibling::html:dd)[1])}</def>
</blog-term>
Gets you a nice list of terms like this:
<blogging-terms>
<blog-term>
<term>Autocasting</term>
<link>http://en.wikipedia.org</link>
<def>Automated form of podcasting that allows bloggers and blog
readers to generate audio versions of text blogs from RSS
feeds.</def>
</blog-term>
<blog-term><term>Audioblog</term>
<link>http://en.wikipedia.org</link>
<def>A blog where the posts consist mainly of voice recordings sent
by mobile phone, sometimes with some short text message added for
metadata purposes.
(cf. podcasting)</def>
</blog-term>
</blogging-terms>
The whole web is now your service ready data source. Happy scaping!
This can also be done in XQuery using the eXist Open Source Native XML Database.
Instead of xdmp:http-get() you can use html:doc() and you do not need xdmp:tidy() as html:doc() will tidy the HTML into a suitable XML form automagically.
Posted by: Adam Retter | March 26, 2007 at 11:00 AM
html extension module has been replaces by the httpclient extension module. You should now use httpclient:get() instead of html:doc() - tidying of html content is still performed automagically
Posted by: Adam Retter | June 26, 2008 at 06:03 AM