November 09, 2007

Code with the XQuery Experts: 11/29, London U.K.

Right after the (American) Thanksgiving holiday and just before London Online, the world's best XQuery coders, Jason Hunter and Ryan Grimm, will be over in Royal London hosting an XQuery day.  The event details are:

Code with the XQuery Experts
Friday, November 30, 2007
8:30 am PT - 5:00 pm
Olympia Grand Hall
London, England

Sign up for it here.

Jason and Ryan have been using XQuery (and MarkLogic Server) from the very start and it should be an excellent event to see firsthand why XQuery is THE application language for content applications.

What's more, you can bring your own XQuery chops and win an 8GB iPhone for the Best XQuery App at the event.  There are some helpful tools in the Mark Logic code workshop to get you started and I expect full credit if something in my tutorial helps you win first place!

But you might want to 'enhance' this one:  I asked Jason and Ryan for something neat from the amazing XQuery powered email discovery application they've built called MarkMail and they send me this very elegant FAQ generator.

The first cool thing is that it's a complete FAQ in a single complete XQuery - starting with the content and then the code to present it:

(: content as XHTML in a div - edited by MT to be a sample :)

let $content :=
<div id="content">
<a name="general"/>
<h1>GENERAL FAQ</h1>

<a name="quick"/>
<h2>Given 15 seconds, what should I know?</h2>
<ul><li>MarkMail lets you search 4,000,000+ emails across 500+ Apache mailing lists</li>
...</ul>
<a name="whatisit"/>
<h2>What is MarkMail?</h2>
<p>
MarkMail is a community-focused searchable message archive, accessible at <a
href="http://markmail.org">http://markmail.org</a>, developed and hosted by <a
href="http://www.marklogic.com">Mark Logic Corporation</a>.
...
</p>
...
<a name="techie"/>

<h1>TECHIE FAQ</h1>
<a name="whatshard"/>
<h2>What's hard about searching email?</h2>
<p>
Email doesn't work well in a relational model because there's too much free
text.  It doesn't work well in a search engine either because there's too much ad hoc structure and hierarchy ... We've found email works naturally as XML.
...
</p>
<a name="store"/>
<h2>How do you store the emails?</h2>
<p>
Each email is stored an XML document inside MarkLogic Server.
...
</p>
...
</div>

(: from this XHTML node we can generate the table of contents including a split between regular and techie FAQ
:)

let $toc :=
    <div class="toc">
        <h1>Table of Contents</h1>
        <ul>
        {
            for $head in $content/h2[. << $content/h1[. = "TECHIE FAQ"]]
            let $name := $head/preceding::*[1][name(.) = "a"]/@name
            return <li><a href="#{ $name }">{ string($head) }</a></li>
        }
        </ul>
        <h3>Techie FAQ</h3>
        <ul>
        {
            for $head in $content/h2[. >> $content/h1[. = "TECHIE FAQ"]]
            let $name := $head/preceding::*[1][name(.) = "a"]/@name
            return <li><a href="#{ $name }">{ string($head) }</a></li>
        }
        </ul>
    </div>

(: then we put it all together :)

let $body := (
    <div id="docs">
        { $toc }
        { $content }
    </div>,
    <div style="clear: both"/>
    )

return

(: and output it :)

$body

I like that XQuery gives you a compete tool kit for content:   even if you just have simple HTML you can do things like use the << and >> order comparison operators to pull out all of the <h2> elements that come before the Techie FAQ H1 and grab the <a> element right before the <h2> using the preceding axis.

And you're creating the output as you go - with an XHTML FAQ generated in less than 30 lines.  To see the 'live' FAQ click here.

 

I hope you can make it out to the event at Olympia and can see first hand the many cool things you can do with XQuery, the right tool for the content application job.

Matt

November 05, 2007

Agile Publishing and XQuery

This week, Mark Logic is sponsoring a breakfast seminar on agile publishing. 

The speakers are great.  I've seen Howard Ratner, the CTO of Nature Publishing, speak a couple of times and he is entertaining and informative.  I've also met with David Worlock who is both an amazing font of knowledge and an engaging speaker.

And the topic is spot on:  Looking at product development through the agile lens yields many new possibilities for how you can both build and deliver new content products.

For me, it has always been about ways to speed up and simplify the product development process.  As the Technical Director at PC World Online I was on the receiving end of hundreds of small, 'can't you just' requests.  I hated saying no - these were the good ideas that made us innovate and if I said 'this isn't in the schedule' or 'maybe next year' then we'd get nowhere.  And in 1997 at the start of web publishing we had a LOT of ground to cover!

So we did a lot of small projects, we launched things in days rather than months and we made the most of our tools, pushing XML into databases the best we could and using Tcl (!!) and Vignette as a basic framework for rapidly developing many, simultaneous projects.  It was certainly agile (if with a little 'a') and that term applied to not just the tech team, but everyone working together to create new products.

These same trends are still around and are maybe even more important as publishers today need to invent, create new products and break the mold of the traditional products (which we were very busy inventing 10 years ago!).

And the tech teams are still getting those 'can't you just' questions . .  . but now you can use XQuery instead of all those clunky database/webcms/app layer toolsets.  XQuery is *the* native programming language for XML and XML is *the* model for content.  So with XQuery you don't spend a lot of time translating your content between tables, objects and outputs.  Instead, you just get right to work on those 'can't you just' questions.

And when you use an XQuery engine like MarkLogic Server,  things go even faster because you can load any content without up-front configuration and can perform any query on any part of the XML.  This turbo-charges the process but letting you get a-hold of some content and right away start writing your application (like we did in the first tutorial).

I often say that we wish we had XQuery back then.  Well, you do have XQuery now and what a difference it makes!

So come on out this week and explore the world of agile content products - here are the details:

The Agile Publishing Imperative:
Accelerate the Creation of Information Products

Thursday, November 8
8:00 am - 11:00 am
Four Seasons Hotel
Cost: Complimentary

Registration

Hope to see you there,

Matt

 

October 24, 2007

XQuery: The Search Language For A Multi-Platform Future

I keep a couple of google alerts for all things XQuery and I was very pleased to see this headline pop up a couple of days ago:

XQuery: The Search Language For A Multi-Platform Future
  The advent of wireless internet access has made web design a very complicated matter. Previously, all web browsers were created equal. HTML was the only language used to create web sites, and it was only possible to go online with a ...

Wow!  Along with the XML.com article XQuery, the Server Language here is someone who really gets the power of XQuery as an application . . .  and for search too!

And it's true, XQuery really is the search language for a multi-platform future.  Where XML is the powerful model for content from meta-data to books, XQuery is the application language that unlocks the potential of this content to build content applications to deliver content to multiple formats.

With MarkLogic's search extensions (which anticipate the additions to the standard) XQuery also becomes a search platform to power applicaitons.  Unlike a search engine which can only point to the content, XQuery can search, manipulate and render the content all in one system.

This lets you build full applicaitons on one platform.  For instance, as the Jim Pretin (the author of the article) suggests, a dating site.

If you the data for the dating site was modeled in XML that looked something like this:

<singles>
    <person>
        <name>Peter</name>
        <sex>male</sex>
        <age>32</age>
        <interests>golf, skiing, camping, gazing at the stars</interests>
    </person>
    <person>
        <name>Jane</name>
        <sex>female</sex>
        <age>27</age>
        <interests>skiing, swimming, horseback riding</interests>
     </person>
    <person>
        <name>Fred</name>
        <sex>male</sex>
        <age>32</age>
        <interests>golf, football, car racing</interests>
    </person>
</singles>

XQuery would let you perform the basic operations of searching for a date by matching conditions (a man over 30) and also let you do full text search against the content - say in the <interests> element where content is entered as free text:

for $person in input()/singles/person[./sex eq "male"][./age > 30][cts:contains(./interests, "gazing")]
return
    <date>{$person/name}</date>

*cts:contains() is a MarkLogic Server search built-in that lets you do full text search instead of the regex powered contains() in the current XQuery spec.

With this single function we get a potential date that is a male, over 30 and mentioned 'gazing' in his interests.  We can then output this in any format - maybe an SMS or post to facebook or some simple XML:

<date>
    <name>Peter</name>
</date>

We can do this all in one language  - yup XQuery is powerful stuff and certainly, as the article points out, the right tool for a multi-platform future.

But there is just one little problem with the article . . . it keeps appearing over and over and over. 

At first I thought it was a mistake - maybe I was rereading the same google alert twice?  But then it just kept coming, sometimes two and three times in a single alert, sometimes with different titles (XQuery the Search Language of Tomorrow), but always there - a constant companion in my google alert.

So it looks like Jim Pretin (who runs a service called forms4free that will "GUARANTEE you a working form") is actually much more interested in spotting trendy keywords and spamming the world with content than actually promoting XQuery.
Xquerysearch_2
And he's quite good - check out this link from page FIVE (!) of a google search for the article.

But, all in all, this makes me pretty happy.  XQuery is now a buzzword worth spending who knows how much time replicating content around the internet to get search hits!

Just another milestone for XQuery:

    Standard (check!)
    Powerful search and query across loads of content (check!)
    Powers innovative content applications (check!)
    Internet buzzword (check!)

Yes, we have truly arrived!

October 05, 2007

XQuery at Work

I got some good feedback on my post about enrichment (thanks!) and I thought I would expand a bit on the first step of getting the list of items to power the enrichment.

I like this example because it just seems to be something that comes up over and over again.  I've been working with web technologies for  10+ years now and this was one of the first things I learned how to do . . . and its something I am still doing day to day.  Its a truism of working with the web: at some point you will need to reach out and grab something off another website. 

XQuery is great at this because no matter what the context - from SOA to complex sounding (but pretty simple) federated search to just needing to get a list of weapons to enrich Shakespeare - the basics are make an HTTP connect and request and parse the usually XML response.

And there is nothing better for parsing and processing XML than XQuery.

So in this example:

fn:string-join(
for $weapon in xdmp:tidy(xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")[2])[2]//*:li/*:a
return
fn:string($weapon), '","')

the idea is to get the web page and make it into a sequence of strings.  In case you are wondering, every XQuery returns a sequence - even if its sequence of 1.  A sequence can have any number of items of any type.  Pretty useful as we will see.

The first step is to get the page with xdmp:http-get() - a MarkLogic XQuery extension.  This returns a sequence of two nodes.  The first is the response node with the header info.  The second is the actual page/image/whatever that you requested:

xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")

returns

(<response xmlns="xdmp:http">
<code>200</code>
...
<headers>
<date>Wed, 03 Oct 2007 00:44:16 GMT</date>
<server>Apache</server>
. . .
</headers>
</response>
,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
</head>
<body>
...
</body>
</html>)

So while it looks like XML, that second item it really text.  We need to turn it into XML with tidy:

xdmp:tidy(xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")[2])

The [2] gives us the second node of the response sequence and tidy returns:

(<status xmlns="xdmp:tidy">
<message>Info: . . .
</message>
</status>
,
<html xml:lang="en" lang="en" dir="ltr" version="-//W3C//DTD XHTML 1.1//EN" xmlns="http://www.w3.org/1999/xhtml">
<head>
...
</head>
<body>
</body>
</html>
)

Yup, another sequence . .. but now that second node is XML as produced by tidy (with all of the errors etc noted in the <status>).

From here on, we can use XQuery to process XML into that sequence of we need to do matches inside the text of the Shakespeare plays.

First we need to get the list out of the page.  It turns out that within that Wikipedia page, all of the items are listed within <li> elements AND are with link anchors - <a>.

Our first step is to use XPath to get just the sequence of <a> elements - and to do this I'm using * as the namespace - it's likely the XHTML namespace, but this way I don't have to even check:

xdmp:tidy(xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")[2])[2]//*:li/*:a

This gives us another sequence - this time of <a> elements:

(<a href="#Axes" xmlns="http://www.w3.org/1999/xhtml"><span class="tocnumber">1</span> <span class="toctext">Axes</span></a>
,<a href="#Daggers_and_knives" xmlns="http://www.w3.org/1999/xhtml"><span class="tocnumber">2</span> <span class="toctext">Daggers and knives</span></a>
,<a href="#Swords" xmlns="http://www.w3.org/1999/xhtml"><span class="tocnumber">3</span> <span class="toctext">Swords</span></a>
,...
)

We can now use the FLOWR structure of XQuery to process this sequence one <a> at a time.  FLOWR stands for For Let Order by Where Return.  For assigns each item of a sequence to a variable, let (not used in our example) can hold additional values related to that item, order by allows you sort (default is document order) where allows you to filter and return is the output for each item in the sequence.

In our example we assign each <a> element to a variable, then use the fn:string() function to get the string value.  Running this over the entire sequence of <a> elements creates a sequence of strings:

for $weapon in
xdmp:tidy(xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")[2])[2]//*:li/*:a
return
fn:string($weapon)

This returns:

("1 Axes","2 Daggers and knives", "3 Swords", .... "Broadsword", "Claymore", "Cutlass", "Falchion", ...)

We're getting the header values, but thats OK - they just won't match anything.

And we now have a sequence of strings.

Except that I wanted to save this as a file so I could run it over and over again while I got the enrichment right.

So while I've written it out as a sequence of strings (and this is the accurate state within the server) if I output this, I actually get this:

1 Axes
2 Daggers and knives
3 Swords
...
Broadsword
Claymore
Cutlass
Falchion
...

This is also correct . . . when output, the strings are represented as, well, strings.  And certainly not as the  comma delimited list of quoted items that is really a string sequence constructor . . . and what I need.  So I need to make a single string that *is* comma delimited and has values in quotes so I can then stick that inside my code and it will become a sequence of strings.

To do this I just use my favorite XQuery function, fn:string-join().  This takes ANY sequence and makes it into a nice single string with whatever delimiter you select between each element:

fn:string-join(
for $weapon in
xdmp:tidy(xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")[2])[2]//*:li/*:a
return
fn:string($weapon)
, '","')

The delimiter here is ",".  Like many content friendly languages, XQuery lets you use single and double quotes so you can return these special characters.

And unlike many other languages, fn:string-join correctly puts the delimiters *between* every item in the list . .. and doesn't also put it on the last item which you then have to correct when you write this function yourself in java, perl . . . just about anything else.

The result is *almost* a single string that represents a sequence of string that can be put into a file or another XQuery:

1 Axes","2 Daggers and knives","3 Swords","4 Blunt weapons" ...

I didn't do the extra step of adding the first '("' and the last '")' . . . we can just concat these on:

concat(
'("',
fn:string-join(
for $weapon
...
,'")'
)

And now we have out string for enrichment:

for $doc in xdmp:directory("/content/bill/")
let $weapons := ("axe", "sword","dagger", "falchion" , "etc., etc.")
return
    ....

To read the rest of the story about what to do with a list of weapons and the works of Shakespeare, Click here.

For my part, I use this pattern and process all the time.  Just last week someone suggested that it would be really cool to use Flickr to augment information on cameras by showing actual photos taken with that camera.

What a good idea!  And, once you know all these tricks in XQuery, its a one-liner, thanks to flickr's nicely structured pages (where sample images are in a "DetailPic" class):

xdmp:tidy(xdmp:http-get("http://www.flickr.com/search/?q=photo&cm=nikon%2Fd200")[2])[2]//*:td[@class="DetailPic"]//*:img

I'm asking for my camera (Nikon D200) and a generic term 'photo' and I get these nice pictures:

1486184226_c43b1021a6_m_2 1465535989_5cfcad565e_m_2 1468851191_eb867466df_m











All ready to be inserted into a rich, XQuery powered, content application.

Matt

August 22, 2007

XQuery and Lazy Enrichment: Keeping me Busy

I was searching for a blog topic and my son Josh suggested that I write a blog about how busy I've been in the last couple of weeks.

Excellent idea Josh!  Because it's XQuery and the cool things you can do with it that have been keeping me busy.

On top of helping people explore new ways to get the most of their XML content from with XQuery powered projects in Digital Asset Delivery (that take advantage of the benefits of the XML meta-data model I wrote about a while ago) and working on XML powered content production workflows (similar to the ground breaking XQuery powered SafariU from O'Reilly), I've been also been exploring using XQuery to enhance XML with something called lazy enrichment (a term coined by Mark Logic CEO Dave Kellogg).

The idea goes something like this:  if you have some content (say the complete works of Shakespeare) and they are loaded into an XML content server (say MarkLogic Server as in this tutorial), then wouldn't be interesting to cross cut the content with a topic like medieval weapons and be able to explore the Shakespeare texts in this new context?

This is text analytics: process text to get match topics and categories of items and give that content new meaning.  There are great engines out there that do this.  They analyze text, sentence structure and patterns and can automatically create categories, lists of people, specific places and even market specific items like company names or medical terms.

But using XQuery and some XML you can do it yourself with lazy enrichment and end up with something much better than lists of items.

The first step is to get some information on your chosen topic.  While the engines that you pay lots of money for have detailed databases on specific topics, they all start with a taxonomy or controlled vocabulary of the topic.  But if you know what your topic is, you can make your own list (and chances are you already have one).

For our example, lets go get a list of medieval weapons from wikipedia:

fn:string-join(
for $weapon in xdmp:tidy(xdmp:http-get( "http://en.wikipedia.org/wiki/List_of_medieval_weapons")[2])[2]//*:li/*:a
return
fn:string($weapon), '","')

This uses the MarkLogic Server HTTP built-ins to get the page and then does some processing to get just the list of weapons.  It's not called the best scraper on the web for nothing.  While we're at it, lets make a nice comma delimited list with quotes since we will want a text sequence for our enrichment process.

Now that we our list of weapons, lets process the text:

for $doc in xdmp:directory("/content/bill/")
let $weapons := ("axe", "sword","dagger", "falchion" , "etc., etc.")
return
    xdmp:document-insert(xdmp:node-uri($doc)),
        cts:highlight($doc,
            cts:or-query(
            for $test in $weapons
            return
            cts:word-query(fn:lower-case(fn:string($test)))
            )
            ,<weapon>{$cts:text}</weapon>
        )
    )

This takes every play I've loaded and, using MarkLogic Server's search features, finds the words in the play that match the list and creates new markup around that word to enrich the content with <weapon> elements.

This is lazy enrichment - taking our own knowledge of a topic, creating some rules around the matching (that can go well beyond simple search string matching) and then enriching the content in place.  We're not creating any separate lists or extracting this to a database - the content now has the topic embedded in it.

What can we do with this?  Well how about some really cool queries:

The first one is to actually create those topic lists - they are after all very useful.  So what are all the weapons that were in the plays?

fn:string-join(cts:element-values(xs:QName("weapon")), ",")

Using MarkLogic Server's element-value indexes, this gives us a report on the weapons found.

But how about something a bit more interesting:  which characters talk about weapons (so we can stay away from them)?

for $speaker in distinct-values(//SPEECH[./LINE/weapon]/SPEAKER)
return
<violent>{fn:string($speaker)}</violent>

Or maybe we'd like to know more about a specific weapon - in the list created in the first query in my database something called a 'falchion' showed up.  What the heck is a 'falchion'?  Lets make a report:

for $weapon in //SPEECH/LINE/weapon[.="falchion"]
return
<weapon>
    <name>{fn:string($weapon)}</name>
    <play>{fn:string($weapon/ancestor::PLAY/TITLE)}</play>
    <character>{fn:string($weapon/ancestor::SPEECH/SPEAKER)}</character>
    <speech>
        {$weapon/ancestor::SPEECH/LINE}
    </speech>
</weapon>

This returns us a nice report:

<weapon>
    <name>falchion</name>
    <play>The Tragedy of King Lear</play>
    <character>KING LEAR</character>
    <speech><LINE>Did I not, fellow?</LINE>
                <LINE>I have seen the day, with my good biting          
                <weapon>falchion</weapon></LINE>
                <LINE>I would have made them skip: I am old now,</LINE>
                <LINE>And these same crosses spoil me. Who are you?</LINE>
                <LINE>Mine eyes are not o' the best: I'll tell you straight.</LINE>
    </speech>
</weapon>

Because the weapon is marked up in the actual content, I can leverage the structure, the content around it and XQuery to give me as complex a report and analysis as I can possibly want.

But I still don't really know what a falchion is - besides the fact that its 'biting' and that Lear was fierce with it when he was young.

So for some extra credit, lets reach out to another source and, in place, augment our reading of Shakespeare to give us some new understanding.  In our transformation of the plays for display (using the XQuery transformers), let's go ask google what our weapon is:

define function weapon($x as element(), $params as node())
{
      <span>
             <span onclick="toggleDisplay(document.getElementById('pop-up'))">
             { passthru($x, $params) }</span>
             <span id="pop-up" style="position: absolute; display:none; border-style: outset; border-width: 2; background: #ffffff; font-family: Arial; font-size: 11px">
             <div style="background-color: #FFC770;">
              <h3>Google Search Results</h3></div>
             <div><ul class="basicMenu">
             {
             for $a in
             (xdmp:tidy(xdmp:http-get(fn:concat("http://www.google.com/images?q=", fn:string($x)))[2])[2]//*:div//*:td)[1 to 2]
             return
             <p>{$a}</p>
             }
             {for $a in
             (xdmp:tidy(xdmp:http-get(fn:concat("http://www.google.com/search?q=", fn:string($x) , "&submit=Google+Search"))[2])[2]//*:div[.//*:table]/*:a)[1 to 3]
             return
             <p>{$a}</p>
             }            
             </ul></div></span>
      </span>
}

This gives us a nice reading of the plays that looks like this:

Falchion


And we can see that a Falchion is a mean looking, huge sword.  Watch out for Lear!

Lazy enrichment turns out to be pretty powerful stuff!  You can annotate and augment texts with your own concepts and build rich displays to get new meaning out of even tried and true Shakespeare.

Thanks for the idea for the blog Josh - it turns out that even a 'lazy' idea in XQuery is exciting stuff that certainly does keep you busy!

July 04, 2007

Celebrate (XML) Independence

A couple of weeks ago Kurt Cagle posted XQuery, The Server Language on XML.com.  I like this article a whole  because it:

  • Explores how XQuery is more than a query language
  • Shows how XQuery is actually THE server-side scripting language to create HTML
  • and contains this very nice example of how things used to be:
$buf ="<html><head><title>".$myTitle;
$buf += "</title><body>";
$buf += "<h1>This is a test.</h1>";
$buf += "<p>If this were an actual emergency, we'd be out of here by now.";
echo $buf;

Yup - back before XQuery, in the days when your dad had to ride his bike uphill both ways to school and people didn't even have answering machines (you had to just call back later) . . . this is how you had to make HTML.

Creating strings to represent elements is fraught with danger - while the above works, its hardly valid. But it was the only simple choice (the other options, as Kurt points out, were Java pipelines with multiple moving parts).

Aren't we glad we have XQuery?

This sort of thing is now done in an entirely XML-centric environment where you just create elements instead of strings you hope will work out:

let $mytitle := "title from input"
return
<html><head><title>{$mytitle}</title></head>
<body>
<h1>This is a test.</h1>
<p>If this were an actual emergency, we'd be out of here by now.</p>
</body></html>

This is a very liberating moment.  Here is a scripting language, built to create XML - the language of the web.  There is no impedance mismatch between text, objects, and elements . . . its all about the tags.

So celebrate the XML independence by getting a good XQuery engine like MarkLogic Server and bring a little XQuery into your content applications. 

But be careful, its pretty addictive . . . luckily, these days you can just let the answering machine pick up.

June 19, 2007

XML and XQuery as the Model

A couple of weeks ago at the Mark Logic user conference (where there were many great presentations including a super Tim O'Reilly keynote) Jason Hunter floated the idea that we may be entering into a paradigm where XML tools, like MarkLogic Server, are used as the basis for applications even if the data source isn't XML.

The idea is that flexibility and functionality we've seen with XML content applications like Congressional Quarterly's Legislative Impact and Harvard Business School Publishing's content logic is fundamentally different and can be applied outside of traditional XML content environments.

Jason's example had to do with email archives.  There have been many attempts to work with email archives and most have used databases where the email is broken up into fields.  However email, like many data sources, is full of anomalies.  So even though it looks simple, the resulting database schema is often very complex and inadequate (especially if you start to deal with representing threads).

XML provides a much better model where complexity and variations are actually expected and, it turns out, its fairly easy to turn email into XML.

However the key is what to do with the XML.  I think you'd be hard pressed to say lets turn email into XML and then use a database or filesystem to store it, SQL to extract it and XSLT to transform it.  So no wonder people put up with complicated database schemes since from there the application was at least a standard database -> app server affair.

But this all changes with XQuery and, in particular, MarkLogic Server which can process XQuery at search engine speed and has added a few helpful extensions.

With emails in XML that look like this

<email>
<author>matt</author>
<subject>test email</subject>
<!-- some more headers, etc. -->
<body>Example email body</body>
</email>

using MarkLogic Server its super easy to let people search the content's of emails . . . and restrict it by a certain author or other header info:

for $email in cts:search(/email, cts:and-query((cts:element-query(xs:QName("author"), "matt"), cts:element-query(xs:QName("body"), "email"))))
return
    <div>
        <b>Email from:</b> {$email/author}<br/>
        {$email/body}
    </div>

the element-query search built-in lets you target our search against the XML elements and fine tune your search.  And displaying it is a snap:  we just output the HTML we want. 

With a traditional database approach this isn't even possible: you either do less complicated SQL queries or you need a search engine to index the database and application code to call the search engine and then get the content out of the database.  It's certainly not 5 or 6 lines of code.

But lets make something really useful: what if we don't know the author's name?  What if we started with just the keyword search but wanted to give the users some options to drill into the result?

Some of the very cool new features in MarkLogic Server 3.2 make this a snap. 

The first step is to make use of the element values built-in:

cts:element-values(xs:QName("author"))

Without any arguments, this gives us all of the unique values of the author element.  This is hugely useful - especially if you are dealing with a bunch of content that was organically created or has a semi controlled vocabulary (like an email archive).  This will let you see all the actual values in the content.  You can use it to correct or normalize the content or even make a lookup list for users even though there is no 'lookup table', just the actual values in the content. 

Super cool and super flexible:  if you encounter a new header value - or decide to process the content to create some new lookup values (like first and last name) you can apply the same functionality, just like that.

But, like any good ginsu knife commercial . . . there's more:  element-values also takes a search query and, new with 3.2, provide the frequency within that search:

<div><b><u>Authors in your search</u></b><br/>{
for $author in cts:element-values(xs:QName("author"),"",(), cts:element-query(xs:QName("body"), "email") )
return
    <a href="refinesearch.xqy">{$author} ({cts:frequency($author)} emails) <br/></a>
}</div>

This code asks for the values of the author field, but restricts the results to the authors of emails that matched our keyword search of the body.  Then it creates a neat little widget to refine the search by the authors that looks like this:

Authors in your search
matt (11 emails)
brian (8 emails)
peter (4 emails)

Now we're really cooking.  All the user does is enter a keyword search and the application presents them with really advanced features to refine it by author.  This can be applied to any header value like recipient or even date and even time ranges.

in another one of the user conference talks, Alan Darnell from the University of Toronto talked about building a Digital Library with MarkLogic Server.  He focused on this kind of user interaction as being an ideal complement to the one box search habits fostered by Google.  While google can't really augment your search, a library or a content application can in fact be built to guide you since it is built with a specific purpose.  For Alan, this was an opportunity to reverse a trend and instead of replacing the libarian make a search application that actually included a virtual libarian, waiting and ready to helps the user find what they need.

Starting with XML as the model, search and discovery applications are starting to go places we haven't yet considered.  If email can benefit from the XML and XQuery model, what other data sets are out there waiting to be tapped?

So have a look in your files, peek into the rigid databases holding content and give those old data marts a good shake:  a new life as XML is waiting to unlock the hidden value in that content.



 

May 14, 2007

Teaching XQuery

This morning I'm in San Francisco teaching a class on XQuery before the Mark Logic User Conference swings into session tomorrow.

As part of the class we'll be doing a lab using the Shakespeare XML and (the real reason for this post) the sample content is right here:

Download bill.zip

We're going to try, in about an hour, to download, install and configure MarkLogic Server and then load this content and do some queries with CQ.  Very much like the first part of the tutorial.

Yup - thats a live session with ~30 people all of whom will be executing XQuery after about an hour.  XQuery makes this possible - it really just works

Wish us luck!

Matt

May 10, 2007

Mark Logic User Conference Next Week plus MarkLogic Server 3.2 Released!

Just a quick reminder about the Mark Logic user conference next week in San Francisco.

As previously reported, there will be no naked people, but there will be lots of XQuery enthusiasts and talks on some of the most innovative information products from Congressional Quarterly, Harvard Business School Publishing, McGraw-Hill Education and more.

Tim O'Reilly will be giving a keynote on Wednesday and Mark Logic CEO Dave Kellogg will kick things off on Tuesday.

Its not too late to sign up and its still FREE!

To make things even better, on the eve of this great event Mark Logic has released MarkLogic Server 3.2!

As Mark Logic Products VP Ian Small says, this release is a real diamond that really puts the power of XQuery in your hands.

You can work with more types of content with extensive language support (down to the node level - cool!),  new conversions including Office 2007 (yup - now you can actually do somethng with Office XML) and more encodings (so you can now scrape non-utf-8 web pages). 

It also adds powerful search capabilities for efficient complex searches, makes the already blindingly fast performance even faster and adds content analysitics to power user navigation and information displays. Plus there are lots of goodies for XQuery developers including debugger support.

I'll be posting from the user conference so more on all of this to come.

Hope to see you in San Francisco!

Matt

P.S. maybe next year we'll hire Spencer Tunick

May 07, 2007

Go Native! CMS(s), XML and XQuery

This Thursday, May 10th 2007, I'll join Lisa Bos from Really Strategies to talk about the benefits of a native XML CMS and have a look at RSuite, an XML Content Management System powered by XQuery and MarkLogic Server.

This is a really interesting topic:  at the 2006 London Online show 51 of 60 software vendors selected the 'content management' category and of these 31 were offering a Content Management System.  In that crowded field, things like automatic accessibility, ease of use and eCommerce built-ins stood out.  But full support for XML and the powerful sub-document access and control it brings was almost totally absent.

For people who work with content, investing in XML is key to their business.  But as far as CMS systems are concerned, they often have to shoehorn, change or otherwise mangle their XML to work with them.  A couple of weeks ago, Lisa discussed this on the Really Strategies blog.  Dave Kellogg also has some thoughts on this.

So join Lisa and me on Thursday to hear more about XML and content mangement.

Sign up here and hope to see you there.

MT