a smart xml syndication platform
Posted by moonwatcher at Saturday, November 07, 2009 | 2 comments
Labels: code , featured , xml
In 2003 i joined Conceptis Puzzles, a world leading logic puzzle company, in its struggle to translate its success in the printed world to the new online medium. This was an interesting challenge. The company was manufacturing, cataloging, maintaing and selling text documents. Essentially the entire operation could be modeled as a huge graph with XML documents on the nodes.
The data layer
SQL:2003 was just being finalized and introduced XML as a native data type to the SQL world. Instead of storing XML nodes as text, which would mean parsing and building a DOM in the application layer for every query, it would store a binary, indexed, data structure that would allow querying it with low complexity and support XPath and XQuery right inside the SQL queries.
CREATE TABLE Resource ( id int NOT NULL PRIMARY KEY, resourceClassRef AS getResourceClass(node) PERSISTED, href AS getResourceHref(id) PERSISTED UNIQUE, node XML (CoreSchemaCollection) NOT NULL, CONSTRAINT CheckResourceNode CHECK (checkResourceNode(node) = 1) ); CREATE TABLE ResourceClass ( id int NOT NULL PRIMARY KEY, href AS getResourceClassHref(id) PERSISTED UNIQUE, node XML (CoreSchemaCollection) NOT NULL );
This was exciting, I could now dissect and assemble new XML documents from the nodes in the tables by simply writing an SQL query. The database could also be made to force the XML to validate against an XSD so i don't have to worry about checking the XML for validity in the application layer. But the final touch was using XPath queries in scalar valued functions to expose parts of the data in the document, these fields were calculated only on tuple creation and update and where maintained by the database and so had no effect on performance.
CREATE function getResourceClass( @node XML ) returns int WITH schemabinding AS BEGIN return @node.value( '*[1]/genealogy[1]/@resourceClassRef', 'int' ) END CREATE function checkResourceNode(@node XML) returns int WITH schemabinding AS BEGIN return @node.exist( 'declare namespace ' + 'sresn="backwardincompatible.com/xsd/resource/node";' + 'sresn:resource' ) END CREATE function getResourceHref(@id int) returns varchar(256) WITH schemabinding AS BEGIN return 'backwardincompatible.com/XML/resource/' + Convert(varchar(32), @id); END CREATE function getResourceClassHref(@id int) returns varchar(256) WITH schemabinding AS BEGIN return 'backwardincompatible.com/XML/resource/class/' + Convert(varchar(32), @id); END
I could than map those fields to foreign key constraints that would keep the whole thing consistent on the relational level.
ALTER TABLE Resource ADD CONSTRAINT Resource_ResourceClass_FK FOREIGN KEY(resourceClassRef) REFERENCES ResourceClass(id);
I now had a data structure that amalgamated the scalable performance of a relational database, The power of the loose structure of XML and the near perfect integrity maintained by foreign key constraints and XML Schema. All the application layer had to do was to put the XML node in a tuple and if anything was wrong the database would throw an exception and roll back the transaction.
A package manager
But how would you add or modify the content of the system? Given an XML document, the namespace of the root node could be used to identify the type of the node. Once we know the type we know what table it should go into, so all we need is a package manager who holds a table containing the mapping of namespaces to their corresponding SQL table names. To manipulate a node in the system, you feed it with an XML document and an instruction: add, remove or update.
Ok, so how do we make a website?
So we had all the data in a well structured graph composed of XML nodes and we had a system that made sure its consistent, but how do we make an HTML page? This is where the real fun begins. XSLT is an XML manipulation language, it stands for eXtensible Stylesheet Language Transform. Mathematically speaking (sorry) it is a function from the XML space into the XML space, it takes an XML document and returns another XML document. It also has a very comfortable property, it is itself an XML document, So it can just become part of our graph and be stored in the database along with all the other nodes. So if we somehow have an XML document that has all the data we need to generate a page and we know the correct XSLT to use we can have an XHTML output document, which is just a kind of XML...
So again, how do we make a page? A page is essentially a URI (Uniform Resource Identifier). If when given a URI we can deduce the URI of the XML document we need and URI of the XSLT document we need we are done. Well to be honest not quite, to keep things simple we want to feed our XSLT a single XML document, but what if we need data from multiple nodes from our graph? This is where XQuery comes in, XQuery allows us to write SQL/XQuery queries that return dynamically generated XML documents composed from a whole collection of documents from our database, and even from fragments of nodes. And so we need another layer that maps our URI into a query, that query might be just fetching an actual node from the graph, but it can also fetch a document that is a composition of data we collected from various nodes of the graph.
The URI resolver
Every XML processing pipeline has a module in it responsible for resolving URIs, you need it when parsing an XML document. It might be the XSD document used to validate the XML, it might be an XInclude reference (more on that later) or any kind of other referencing made in the XML space, the resolver is the component that can find and retrieve it. What if we could override this component and teach it to map our URIs to the correct SQL query?
Regular Expressions
Regular expressions are somewhat cryptic when you look at them for the first time, they look like a cat took a stroll on a keyboard. But they have the ability to express, in a single line of simple text, a whole tale about how to extract data from text. In fact they are exactly the missing piece in the puzzle, a regular expression can identify which family a URI belongs too, or in other words which table or query do we should use to find it. It can also extract the primary key from the URI and tell us which tuple in that table has the document we want or what argument we need to provide our query to build the dynamic document.
So regular expressions allow us to map a URI to an actual XML document, real or derived. In fact we can even do something neater, given a page URL, we can use regular expressions to derive the URI for both the input XML document and the XSLT we need to process it. So we can now translate a web page URL into a set of instructions for our system to create that page.
XInclude anyone?
Ok, so i promised i will get back to this. XInclude is a pretty nifty XML dialect with one goal in mind, it allows one XML document to include another XML document (by referencing its URI) at some point in its DOM. This means we can write a document that contains other documents in our system implicitly. You must be asking yourself by now why do we need this? XInclude is what allows us to reuse components without worrying about updating them in all the places they appear. example?
Suppose we have an XML document describing a raster image. It contains a URL where the picture can be found, maybe a caption to display under it and some meta data about it like when it was taken or at what geographical location. Now we want to add a node containing an article that embeds that image, maybe that image shows up on other articles as well. All we have to do is provide an XInclude reference to it in the article, in the place where we want the image to be embedded. If we update the caption on the XML describing the image it would now be updated on all the articles where it was embedded. pretty cool ah?
That doesn't sound very efficient
By now you must be thinking, this is all very nice, theoretically, but really... Every time you want to render a page you have to resolve the URI of your XML document and preform an SQL Query to fetch it, process all the XInclude references, which might mean resolving several hundred other URIs and preform more SQL queries, fetch the XSLT, compile it and run it. All this for every page you render?
Well not really. Remember that URI Resolver we overloaded to teach it to locate our documents? It also has another function, it caches the XML documents in a hash table in memory. Because the only way to modify a document in the graph is to use the package manager, we know when a document has to refreshed. In fact we don't even need to refresh it, we simply purge it from the hash table and next time someone asks for it we store a fresh copy.
A smart XML cache
So how would we go on about implementing such a cache? Like any cache it has to use a limited amount of memory, so when we run out of space, how do we know which record to junk? A cache efficiency is measured primarily by one parameter, the hit rate, or how often when asking for a record it is found in the cache and we don't have to go to the underlaying storage to get it. Remember that roughly speaking, reading from a database means reading from a hard drive which responds in order or milliseconds. Reading from memory on the other hand is a matter of nanoseconds, thats 6 orders of magnitude difference, or a million time faster.
What governs that illusive hit rate is up to the implementation of the cache, but if our cache was to have, say, a hit rate of 99.9%, a realistic figure if your cache is good, that would mean that only one time in a thousand we would actually go and ask the database to give us our document. all other 999 times the action would be a million times faster.
So what we need is a policy that would tell us who to kick out when we run out of the allocated memory. Hardware caches, like the one in your CPU, traditionally use LRU or a Least Recently Used policy (or some modern sophisticated variation of that). The rational is that if you run out of space you get rid of the record used most rarely because it has the least influence on the hit rate. This makes allot of sense for web because pages are served to humans and humans like to do what other humans do, or in other words, if someone was looking at a page, its very likely someone else would want to look at it soon too.
CPUs have a much easier problem though, all their records have exactly the same size, which is 64 bits on most machines today. So when you need to load a new record and you have no space for it you know you have to get rid of just one, least recently used, record and you have space for your new one. But our XML document have highly variant sizes, so when we need to load a new document and we have no space for it, we can't just kick out the LRU, we might need to get rid of more than one. To be more specific, we would prefer keeping in the cache 100 1KB documents that 1 100KB document given the same popularity, because loading the 100 documents will require 100 SQL queries and loading the 1 document requires only one. We factor all those considerations and possibly some others into a scoring function that is governed by the records popularity, its size, maybe the complexity of creating it and so on. This function can look something like this:
f(record) = ((number of hits for record) * (fetching complexity)) / (record's size in bytes)
Now we set two thresholds: the maximum amount of memory we can allocate for our cache, and a backoff threshold, specified in % of the maximum threshold and tells us how much to empty when it overflows. For example, suppose we allocate 100MB and say we want a 10% backoff, so once the cache overflows we dump the least recently used documents until we drop bellow the 90MB line. Now, every we get a requests for a document we update the score to reflect it receiving a hit, and if recalculating the scoring function is computationally intensive (more on this later) we can only do it once in, say, a 100 times, this would still be statistically significant given a heavy enough load.
Wait! we are not done... How do we find the least recently used documents? We don't want to scan the whole cache every time we issue a purge command, that would have O(n) complexity, surely we can do better than that? What if we could keep the records sorted by their score function? then it would be really easy to purge, in fact it would require only O(1) complexity to preform a purge. But they thought you in school that sorting is expensive, at best it can be O(n * log(n)). Well, with some statistic magic and a random generator we can bend the rules a little and keep it almost sorted with O(1) complexity.
Suppose our backoff threshold is 10%. When we put the first record into the cache we keep a singular pointer to it, lets call it the singularity. Every time we put a new record in, we put it above the singularity with probability 0.9 and bellow with probability 0.1. Every time we hit a record we check it after updating its score it is still smaller than the one above it and still bigger than the one bellow him, if not we move it accordingly. We call this a roll. As I mentioned before, under heavy load i suffices to trigger this recalculation only once every several hits.
Another issue is that a document that used to be very popular and accumulated a very high score but is now no longer accessed will have to climb slowly all the way down to be removed. We can easily solve that inefficiency by keeping a timestamp of the last time a record was hit. When we roll a record, we also checks the timestamps on its neighbors and if they are too old we remove them. Don't forget you also have to account for the singularity when preforming a roll. If you happen to be overtaking or undertaking the singularity you have to move the pointer accordingly. All this is still O(1) complexity, so we are cool.
In production, with several hundred thousand hits a day, about 100MB memory allocated and a 10% backoff, we observed hit rates above 99.9%. An interesting side effect of this kind of caching is an absolute resilience to highly variant visit rates on a small set of pages, like some social media sites generate. If a page receives 1000 times the average visit rate for one hour because, for instance, it was featured on the digg front page, the system is not even breaking a sweat. To recap the XSLT compilation issue, the same cache is used to store a compiled version of the XSLT.
