About Archive Tags RSS Feed

 

What would you have me do, Stephen?

6 November 2009 21:50

Things which should exist, but don't yet:

Transparent SQL Cache

Imagine a proxy listening on 127.0.0.1:3306, receiving SQL Queries.

Any query that was "SELECT .." could return the result from a local cache for the appropriate table. Any query of the form "UPDATE" or "INSERT" would flush all caches for the table.

Should be near-trivial: Hash the incoming query & parameters via SHA1sum to get a unique key then store/lookup results in Memcached.

Would it be useful? I think so, but of course it depends on the application and the effort involved.

"Global status"

A single site that would rebroadcast a posted (short) status message to facebook/twitter/your chat client/etc/etc.

Hard part would be receiving comments from the sites it re-served to.

Scraping statuses from facebook is hard, not sure about twitter.

This concludes my Friday wishlist.

ObFilm: Master and Commander

| 10 comments

 

Comments on this entry

icon James Vega at 15:19 on 6 November 2009

As for "Global status", that sounds like what http://ping.fm/ is designed to do.

icon Brett Parker at 15:29 on 6 November 2009

Nice idea, but, it becomes somewhat more complicated when there are SQL functions involved (which maybe updating when you're doing a select) or if the backend database has triggers on the tables causing updates on selects etc...

Cheers,
Brett.

icon Jeff Goldschrafe at 15:32 on 6 November 2009

This approach has a few problems. As with anything that's supposedly transparent, the approach trades robustness for ease-of-use, and I'm not convinced that the ease-of-use performs well enough (or at all, in some cases) to justify the additional effort.

The first is that many SQL queries are inherently non-deterministic. For example, consider the following query that selects 10 random rows from a table (and I'm aware it's not a good query for other reasons, but bear with me):

SELECT * FROM `table` ORDER BY RAND LIMIT 0,1;

This query, if cached, will return the same set of rows every time the cache is hit. In the case of a DB-agnostic transparent SQL proxy, it would have to be specifically informed that this query produces volatile results. You can either do this manually, by breaking the SQL statement's compatibility with the native RDBMS, or you can try to guess what the RDBMS is going to do with the query.

The second reason it's not really workable is that it breaks any functionality that isn't directly caused by the CREATE/UPDATE/DELETE operation. We're talking about triggers, stored procedures/user-defined functions, and other backend functionality that isn't, strictly speaking, "SQL." In some cases, like Pl/Perl or Pl/Python, this could be mitigated with custom functions that let the database server try to flush the cache. The design of it leaves a bad taste in my mouth, though.

There's stuff that occurs on the frontend that is DBMS-specific, and you might not be checking for it. For example, most of your updates in MySQL for an application that does a lot of updates might not be coming in the form of an UPDATE query, but they might be taking the form of a parameterized query where everything is streamed as data by the MySQL client libraries rather than ferried across as SQL statements. Is this the kind of backend engineering anybody actually wants to do?

The third reason is that lots of queries return a lot more data than you probably need to cache. The DBMS has a query cache and a buffer pool already, which is pretty good at doing exactly the type of caching you describe. The reason that memcached is useful is because developers have more control in caching the data that's important, rather than just throwing everything at the cache that comes out of the database. Adding another layer of cache with the exact same contents as the DB's cache and an additional layer of indirection and another point of latency won't speed things along any.

icon Steve Kemp at 15:38 on 6 November 2009

Thanks for the comments - I guess the consensus is that this approach is either too simple (failing to account for triggers and RAND) or not useful (duplicating existing DB caches).

That's a shame, but I think believe the objections seem reason enough not to attempt it.

icon Maik at 15:48 on 6 November 2009

Oracle actually has a product like the one you're suggesting. You connect to it instead of the real database server, and it keeps selected tables and views in RAM. You can choose if the table/view should be cached entirely, partially based on a WHERE clause or partially based on activity (i.e. cache n rows and age out not recently accessed ones).

That's different than the plain query caching you suggested, but actually advantageous in at least two ways: it doesn't matter what the specific query is, anything that hits the right tables can be read from the cache; and the database server can actively and incrementally push changes to cache-relevant tables to the cache as they happen, it rarely has to do a full reload.

Of course, for queries that build their result from both cached and non-cached data, the cache has to make an educated guess if building the result entirely on the server or pulling partial data from the server and combining it with cached data locally is more efficient, and it might guess wrong.

Lastly, that solution is obviously neither free nor open source, but that's not technically what you asked for, you just wanted existence. ;-)

icon Karl E. Jorgensen at 16:05 on 6 November 2009

This sounds a bit like the existing MySQL query cache to me:

http://dev.mysql.com/doc/refman/5.1/en/query-cache.html

although this only caters for identical statements. But it does handle the cache invalidation nice.

icon Dan Carley at 16:11 on 6 November 2009

How would a transparent SQL cache differ from something like MySQL's query cache? Besides being distributed in a memcached-esque way, which I'm not sure would be a massive boon.

icon lamby at 16:53 on 6 November 2009

Could probably hack something together in mysql-proxy for this pretty fast.

icon Benjamin Seidenberg at 17:48 on 6 November 2009

My understanding is twitter has a pretty nice API to do this sort of thing. I also know that there is a way to get your facebook status to update whenever you update your twitter feed, as I have plenty of friends who do this, so that's at least a start. Ought to be pretty trivial to make a pidgin plugin that can push an update to twitter, and have facebook pull it from there.

icon Michael at 03:37 on 7 November 2009

I agree with lamby. mysql-proxy could probably handle the cache thing pretty well. Granted, it would have a fairly limited problem space, but the way I traditionally do SQL it would be acceptable. With some work it could be somewhat intelligent about how it caches.