I got some good feedback on my post about enrichment (thanks!) and I thought I would expand a bit on the first step of getting the list of items to power the enrichment.
I like this example because it just seems to be something that comes up over and over again. I've been working with web technologies for 10+ years now and this was one of the first things I learned how to do . . . and its something I am still doing day to day. Its a truism of working with the web: at some point you will need to reach out and grab something off another website.
XQuery is great at this because no matter what the context - from SOA to complex sounding (but pretty simple) federated search to just needing to get a list of weapons to enrich Shakespeare - the basics are make an HTTP connect and request and parse the usually XML response.
And there is nothing better for parsing and processing XML than XQuery.
So in this example:
fn:string-join(
for $weapon in xdmp:tidy(xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")[2])[2]//*:li/*:a
return
fn:string($weapon), '","')
the idea is to get the web page and make it into a sequence of strings. In case you are wondering, every XQuery returns a sequence - even if its sequence of 1. A sequence can have any number of items of any type. Pretty useful as we will see.
The first step is to get the page with xdmp:http-get() - a MarkLogic XQuery extension. This returns a sequence of two nodes. The first is the response node with the header info. The second is the actual page/image/whatever that you requested:
xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")
returns
(<response xmlns="xdmp:http">
<code>200</code>
...
<headers>
<date>Wed, 03 Oct 2007 00:44:16 GMT</date>
<server>Apache</server>
. . .
</headers>
</response>
,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
</head>
<body>
...
</body>
</html>)
So while it looks like XML, that second item it really text. We need to turn it into XML with tidy:
xdmp:tidy(xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")[2])
The [2] gives us the second node of the response sequence and tidy returns:
(<status xmlns="xdmp:tidy">
<message>Info: . . .
</message>
</status>
,
<html xml:lang="en" lang="en" dir="ltr" version="-//W3C//DTD XHTML 1.1//EN" xmlns="http://www.w3.org/1999/xhtml">
<head>
...
</head>
<body>
</body>
</html>
)
Yup, another sequence . .. but now that second node is XML as produced by tidy (with all of the errors etc noted in the <status>).
From here on, we can use XQuery to process XML into that sequence of we need to do matches inside the text of the Shakespeare plays.
First we need to get the list out of the page. It turns out that within that Wikipedia page, all of the items are listed within <li> elements AND are with link anchors - <a>.
Our first step is to use XPath to get just the sequence of <a> elements - and to do this I'm using * as the namespace - it's likely the XHTML namespace, but this way I don't have to even check:
xdmp:tidy(xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")[2])[2]//*:li/*:a
This gives us another sequence - this time of <a> elements:
(<a href="#Axes" xmlns="http://www.w3.org/1999/xhtml"><span class="tocnumber">1</span> <span class="toctext">Axes</span></a>
,<a href="#Daggers_and_knives" xmlns="http://www.w3.org/1999/xhtml"><span class="tocnumber">2</span> <span class="toctext">Daggers and knives</span></a>
,<a href="#Swords" xmlns="http://www.w3.org/1999/xhtml"><span class="tocnumber">3</span> <span class="toctext">Swords</span></a>
,...
)
We can now use the FLOWR structure of XQuery to process this sequence one <a> at a time. FLOWR stands for For Let Order by Where Return. For assigns each item of a sequence to a variable, let (not used in our example) can hold additional values related to that item, order by allows you sort (default is document order) where allows you to filter and return is the output for each item in the sequence.
In our example we assign each <a> element to a variable, then use the fn:string() function to get the string value. Running this over the entire sequence of <a> elements creates a sequence of strings:
for $weapon in
xdmp:tidy(xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")[2])[2]//*:li/*:a
return
fn:string($weapon)
This returns:
("1 Axes","2 Daggers and knives", "3 Swords", .... "Broadsword", "Claymore", "Cutlass", "Falchion", ...)
We're getting the header values, but thats OK - they just won't match anything.
And we now have a sequence of strings.
Except that I wanted to save this as a file so I could run it over and over again while I got the enrichment right.
So while I've written it out as a sequence of strings (and this is the accurate state within the server) if I output this, I actually get this:
1 Axes
2 Daggers and knives
3 Swords
...
Broadsword
Claymore
Cutlass
Falchion
...
This is also correct . . . when output, the strings are represented as, well, strings. And certainly not as the comma delimited list of quoted items that is really a string sequence constructor . . . and what I need. So I need to make a single string that *is* comma delimited and has values in quotes so I can then stick that inside my code and it will become a sequence of strings.
To do this I just use my favorite XQuery function, fn:string-join(). This takes ANY sequence and makes it into a nice single string with whatever delimiter you select between each element:
fn:string-join(
for $weapon in
xdmp:tidy(xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")[2])[2]//*:li/*:a
return
fn:string($weapon)
, '","')
The delimiter here is ",". Like many content friendly languages, XQuery lets you use single and double quotes so you can return these special characters.
And unlike many other languages, fn:string-join correctly puts the delimiters *between* every item in the list . .. and doesn't also put it on the last item which you then have to correct when you write this function yourself in java, perl . . . just about anything else.
The result is *almost* a single string that represents a sequence of string that can be put into a file or another XQuery:
1 Axes","2 Daggers and knives","3 Swords","4 Blunt weapons" ...
I didn't do the extra step of adding the first '("' and the last '")' . . . we can just concat these on:
concat(
'("',
fn:string-join(
for $weapon
...
,'")'
)
And now we have out string for enrichment:
for $doc in xdmp:directory("/content/bill/")
let $weapons := ("axe", "sword","dagger", "falchion" , "etc., etc.")
return
....
To read the rest of the story about what to do with a list of weapons and the works of Shakespeare, Click here.
For my part, I use this pattern and process all the time. Just last week someone suggested that it would be really cool to use Flickr to augment information on cameras by showing actual photos taken with that camera.
What a good idea! And, once you know all these tricks in XQuery, its a one-liner, thanks to flickr's nicely structured pages (where sample images are in a "DetailPic" class):
xdmp:tidy(xdmp:http-get("http://www.flickr.com/search/?q=photo&cm=nikon%2Fd200")[2])[2]//*:td[@class="DetailPic"]//*:img
I'm asking for my camera (Nikon D200) and a generic term 'photo' and I get these nice pictures:
All ready to be inserted into a rich, XQuery powered, content application.
Matt
I added the example of scraping the Wiki page to the Wikibook on XQuery which I and a couple of chums are working on.
http://en.wikibooks.org/wiki/XQuery/Wiki_weapons_page
The examples are executable on my eXist server. Perhaps it would be interesting to include some alternative executing scripts running on a MarkLogic server?
Posted by: Chris Wallace | October 30, 2007 at 08:01 PM