A couple of weeks ago at the Mark Logic user conference (where there were many great presentations including a super Tim O'Reilly keynote) Jason Hunter floated the idea that we may be entering into a paradigm where XML tools, like MarkLogic Server, are used as the basis for applications even if the data source isn't XML.
The idea is that flexibility and functionality we've seen with XML content applications like Congressional Quarterly's Legislative Impact and Harvard Business School Publishing's content logic is fundamentally different and can be applied outside of traditional XML content environments.
Jason's example had to do with email archives. There have been many attempts to work with email archives and most have used databases where the email is broken up into fields. However email, like many data sources, is full of anomalies. So even though it looks simple, the resulting database schema is often very complex and inadequate (especially if you start to deal with representing threads).
XML provides a much better model where complexity and variations are actually expected and, it turns out, its fairly easy to turn email into XML.
However the key is what to do with the XML. I think you'd be hard pressed to say lets turn email into XML and then use a database or filesystem to store it, SQL to extract it and XSLT to transform it. So no wonder people put up with complicated database schemes since from there the application was at least a standard database -> app server affair.
But this all changes with XQuery and, in particular, MarkLogic Server which can process XQuery at search engine speed and has added a few helpful extensions.
With emails in XML that look like this
<email>
<author>matt</author>
<subject>test email</subject>
<!-- some more headers, etc. -->
<body>Example email body</body>
</email>
using MarkLogic Server its super easy to let people search the content's of emails . . . and restrict it by a certain author or other header info:
for $email in cts:search(/email, cts:and-query((cts:element-query(xs:QName("author"), "matt"), cts:element-query(xs:QName("body"), "email"))))
return
<div>
<b>Email from:</b> {$email/author}<br/>
{$email/body}
</div>
the element-query search built-in lets you target our search against the XML elements and fine tune your search. And displaying it is a snap: we just output the HTML we want.
With a traditional database approach this isn't even possible: you either do less complicated SQL queries or you need a search engine to index the database and application code to call the search engine and then get the content out of the database. It's certainly not 5 or 6 lines of code.
But lets make something really useful: what if we don't know the author's name? What if we started with just the keyword search but wanted to give the users some options to drill into the result?
Some of the very cool new features in MarkLogic Server 3.2 make this a snap.
The first step is to make use of the element values built-in:
cts:element-values(xs:QName("author"))
Without any arguments, this gives us all of the unique values of the author element. This is hugely useful - especially if you are dealing with a bunch of content that was organically created or has a semi controlled vocabulary (like an email archive). This will let you see all the actual values in the content. You can use it to correct or normalize the content or even make a lookup list for users even though there is no 'lookup table', just the actual values in the content.
Super cool and super flexible: if you encounter a new header value - or decide to process the content to create some new lookup values (like first and last name) you can apply the same functionality, just like that.
But, like any good ginsu knife commercial . . . there's more: element-values also takes a search query and, new with 3.2, provide the frequency within that search:
<div><b><u>Authors in your search</u></b><br/>{
for $author in cts:element-values(xs:QName("author"),"",(), cts:element-query(xs:QName("body"), "email") )
return
<a href="refinesearch.xqy">{$author} ({cts:frequency($author)} emails) <br/></a>
}</div>
This code asks for the values of the author field, but restricts the results to the authors of emails that matched our keyword search of the body. Then it creates a neat little widget to refine the search by the authors that looks like this:
Authors in your search
matt (11 emails)
brian (8 emails)
peter (4 emails)
Now we're really cooking. All the user does is enter a keyword search and the application presents them with really advanced features to refine it by author. This can be applied to any header value like recipient or even date and even time ranges.
in another one of the user conference talks, Alan Darnell from the University of Toronto talked about building a Digital Library with MarkLogic Server. He focused on this kind of user interaction as being an ideal complement to the one box search habits fostered by Google. While google can't really augment your search, a library or a content application can in fact be built to guide you since it is built with a specific purpose. For Alan, this was an opportunity to reverse a trend and instead of replacing the libarian make a search application that actually included a virtual libarian, waiting and ready to helps the user find what they need.
Starting with XML as the model, search and discovery applications are starting to go places we haven't yet considered. If email can benefit from the XML and XQuery model, what other data sets are out there waiting to be tapped?
So have a look in your files, peek into the rigid databases holding content and give those old data marts a good shake: a new life as XML is waiting to unlock the hidden value in that content.
Comments