Announcing: django-calais v0.2
At Quotd we've been developing a lot of tools for interacting with the web's various semantic services. One of the most useful and successful services we've come across is Thomson-Reuter's Open Calais. The Calais web API allows you to submit data (plain text, HTML, or XML) to be analyzed for "semantic content". In other words, Calais will process your submission and spit back a list of entities (people, things, companies, products) and what they call events and facts (natural disasters, company mergers and acquisitions, corporate IPOs, police arrests, etc.).
This semantic metadata can get extremely complex but is obviously very powerful. The goal of the django-calais project is to help manage the complexity involved in retrieving, storing and processing Calais results for your Django models. Essentially it lets you submit any Django model to the Calais service for analysis, then automatically parses the results and stores them in a set of semantic models. This allows you to slice your Django objects into very narrow, semantically-meaningful buckets.
Quick links
Example applications
For example, one use we've had at Quotd is for Calais's automatic quote extraction. Calais will return an event and fact response for quotations it finds in the content you provide. We automatically detect this quotation and can extract it as site-content for aggregation. Similar uses are possible with other metadata, perhaps you could index a list of police arrests in your town by submitting your newspaper's police blotter pages and look for arrest results.
Other projects are also using the Calais API, most notably the django-supertagging application, which is well worth checking out. They have implemented the entity-extraction portion of the Calais API and use the results as tagging data for Django models.
API Considerations
I have struggled with the implementation of the semantic data results as Django models. The solution I've settled on appears to work well in the general case but I would be very interested to hear alternative suggestions. Event and Fact representation is most difficult but the current design seems flexible enough to extend, which is probably essential for most applications. This is still a bit of a work-in-progress (ideas welcome).
blog comments powered by Disqus