I was searching for a blog topic and my son Josh suggested that I write a blog about how busy I've been in the last couple of weeks.
Excellent idea Josh! Because it's XQuery and the cool things you can do with it that have been keeping me busy.
On top of helping people explore new ways to get the most of their XML content from with XQuery powered projects in Digital Asset Delivery (that take advantage of the benefits of the XML meta-data model I wrote about a while ago) and working on XML powered content production workflows (similar to the ground breaking XQuery powered SafariU from O'Reilly), I've been also been exploring using XQuery to enhance XML with something called lazy enrichment (a term coined by Mark Logic CEO Dave Kellogg).
The idea goes something like this: if you have some content (say the complete works of Shakespeare) and they are loaded into an XML content server (say MarkLogic Server as in this tutorial), then wouldn't be interesting to cross cut the content with a topic like medieval weapons and be able to explore the Shakespeare texts in this new context?
This is text analytics: process text to get match topics and categories of items and give that content new meaning. There are great engines out there that do this. They analyze text, sentence structure and patterns and can automatically create categories, lists of people, specific places and even market specific items like company names or medical terms.
But using XQuery and some XML you can do it yourself with lazy enrichment and end up with something much better than lists of items.
The first step is to get some information on your chosen topic. While the engines that you pay lots of money for have detailed databases on specific topics, they all start with a taxonomy or controlled vocabulary of the topic. But if you know what your topic is, you can make your own list (and chances are you already have one).
For our example, lets go get a list of medieval weapons from wikipedia:
fn:string-join(
for $weapon in xdmp:tidy(xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")[2])[2]//*:li/*:a
return
fn:string($weapon), '","')
This uses the MarkLogic Server HTTP built-ins to get the page and then does some processing to get just the list of weapons. It's not called the best scraper on the web for nothing. While we're at it, lets make a nice comma delimited list with quotes since we will want a text sequence for our enrichment process.
Now that we our list of weapons, lets process the text:
for $doc in xdmp:directory("/content/bill/")
let $weapons := ("axe", "sword","dagger", "falchion" , "etc., etc.")
return
xdmp:document-insert(xdmp:node-uri($doc)),
cts:highlight($doc,
cts:or-query(
for $test in $weapons
return
cts:word-query(fn:lower-case(fn:string($test)))
)
,<weapon>{$cts:text}</weapon>
)
)
This takes every play I've loaded and, using MarkLogic Server's search features, finds the words in the play that match the list and creates new markup around that word to enrich the content with <weapon> elements.
This is lazy enrichment - taking our own knowledge of a topic, creating some rules around the matching (that can go well beyond simple search string matching) and then enriching the content in place. We're not creating any separate lists or extracting this to a database - the content now has the topic embedded in it.
What can we do with this? Well how about some really cool queries:
The first one is to actually create those topic lists - they are after all very useful. So what are all the weapons that were in the plays?
fn:string-join(cts:element-values(xs:QName("weapon")), ",")
Using MarkLogic Server's element-value indexes, this gives us a report on the weapons found.
But how about something a bit more interesting: which characters talk about weapons (so we can stay away from them)?
for $speaker in distinct-values(//SPEECH[./LINE/weapon]/SPEAKER)
return
<violent>{fn:string($speaker)}</violent>
Or maybe we'd like to know more about a specific weapon - in the list created in the first query in my database something called a 'falchion' showed up. What the heck is a 'falchion'? Lets make a report:
for $weapon in //SPEECH/LINE/weapon[.="falchion"]
return
<weapon>
<name>{fn:string($weapon)}</name>
<play>{fn:string($weapon/ancestor::PLAY/TITLE)}</play>
<character>{fn:string($weapon/ancestor::SPEECH/SPEAKER)}</character>
<speech>
{$weapon/ancestor::SPEECH/LINE}
</speech>
</weapon>
This returns us a nice report:
<weapon>
<name>falchion</name>
<play>The Tragedy of King Lear</play>
<character>KING LEAR</character>
<speech><LINE>Did I not, fellow?</LINE>
<LINE>I have seen the day, with my good biting
<weapon>falchion</weapon></LINE>
<LINE>I would have made them skip: I am old now,</LINE>
<LINE>And these same crosses spoil me. Who are you?</LINE>
<LINE>Mine eyes are not o' the best: I'll tell you straight.</LINE>
</speech>
</weapon>
Because the weapon is marked up in the actual content, I can leverage the structure, the content around it and XQuery to give me as complex a report and analysis as I can possibly want.
But I still don't really know what a falchion is - besides the fact that its 'biting' and that Lear was fierce with it when he was young.
So for some extra credit, lets reach out to another source and, in place, augment our reading of Shakespeare to give us some new understanding. In our transformation of the plays for display (using the XQuery transformers), let's go ask google what our weapon is:
define function weapon($x as element(), $params as node())
{
<span>
<span onclick="toggleDisplay(document.getElementById('pop-up'))">
{ passthru($x, $params) }</span>
<span id="pop-up" style="position: absolute; display:none; border-style: outset; border-width: 2; background: #ffffff; font-family: Arial; font-size: 11px">
<div style="background-color: #FFC770;">
<h3>Google Search Results</h3></div>
<div><ul class="basicMenu">
{
for $a in
(xdmp:tidy(xdmp:http-get(fn:concat("http://www.google.com/images?q=", fn:string($x)))[2])[2]//*:div//*:td)[1 to 2]
return
<p>{$a}</p>
}
{for $a in
(xdmp:tidy(xdmp:http-get(fn:concat("http://www.google.com/search?q=", fn:string($x) , "&submit=Google+Search"))[2])[2]//*:div[.//*:table]/*:a)[1 to 3]
return
<p>{$a}</p>
}
</ul></div></span>
</span>
}
This gives us a nice reading of the plays that looks like this:
And we can see that a Falchion is a mean looking, huge sword. Watch out for Lear!
Thanks for the idea for the blog Josh - it turns out that even a 'lazy' idea in XQuery is exciting stuff that certainly does keep you busy!
Matt
Great mashup of a playscript to visualize props - same principle could be used for locations etc. I'm currently doing a PHD in screenplay analytics so content analysis and visualization is sweetspot of my interest. As a simple start I began with creating content clouds from scripts (see www.scriptcloud.com)but I have lots of other things in mind using an XML schema I have defined. Can you recommend any technology to look at (other than Mark Logic Server of course)? XQuery could obviously extract some value from parsed screenplays but is there anything else out there of interest? Any feedback appreciated. PS Thanks also for the link to the XML Shakespeare site.
Posted by: Stewart McKie | August 23, 2007 at 03:01 PM
Hi Matt. How would you disambiguate Trebuchet the Siege from Trebuchet the font?
Posted by: Shannon | August 30, 2007 at 08:41 AM