Research:Memcached Optimization
Key Personnel
editYuval Meir, Tel Aviv University, yuvalme AT gmail.com, wil AT post.tau.ac.il
Limor Gavish, Tel Aviv University lgavish AT gmail.com, limorgav AT post.tau.ac.il
Nezer Zaidenberg, Tel Aviv University, nzaidenberg AT mac.com
Project Summary
editWe introduce CCS, an innovative cluster cache synchronization framework. Cluster cache synchronization has grown in popularity with the introduction of "memcached". With the growing bit rate of networks, clusters of inexpensive computers often take the role of midrange and high-end servers. However, where disk I/O is concerned, a synchronized caching environment for the entire cluster must be provided - as only by a shared cluster caching approach the disk I/O on a cluster will be used with similar efficiency to a single server and single I/O cache. ``memcached” has become the internet standard for cache synchronization on a cluster. However,``memcached” (currently) only supports LRU caching algorithm. CCS engine, built on top of ``memcached”, allows different caching algorithms to be used on a cluster environment. In this paper we describe CCS environment and experimental results for cache synchronization across cluster.
Methods
editOur research tries to reduce response time and load on web servers and databases that serve them. Therefore, we need statistical data on which links are being accessed and in which order (per client). Thus, we would be extremely thankful if you could provide us with a month worth of access logs of the English language Wiktionary project. We chose to test our algorithms on the Wiktionary database due to the relatively modest volume of data, which simplifies the setup of the testing environment.
Obviously you would not want to release identifying data, so the logs should be sanitized. the user names can be completely removed, while we would like to have the IP addresses preserved to a degree that allows us to know whether two requests came from the same IP or not.
This can be done, for example, by generating a single block (200 bytes or so will be enough) of random data, and replacing each IP address in the log with the MD5 hash of the concatenation of the block and the IP address. the block should be deleted afterwards and not sent to us.
Alternatively, you could just replace the first two octets of each IP with zeroes, but we prefer the previous method because it avoids not distinguishing between users with identical last 2 octets.
Dissemination
editWe are going to publish an academic research.
Wikimedia Policies, Ethics, and Human Subjects Protection
editSince we require anonymous data (see 'Methods'), there should be no ethical issues of concern.
Benefits for the Wikimedia community
editObviously, since Wikimedia projects use Memcached to cache their databases and file systems, our optimization of the Memcached algorithm may improve the response time and reduce latency of fetching pages from the database, meaning web pages will be displayed faster.
Timeline
editGather statistical data - July 2011
Build the required setup for testing our algorithm - August 2011
Writing the algorithms and integrating into Memcached code - October 2011
Benchmarking - November 2011
Writing an academic paper - January 2011
Funding
editReferences
editMemcached official website:
Memcached FAQ:
http://code.google.com/p/memcached/wiki/FAQ
Wikipedia:
http://en.wikipedia.org/wiki/Memcached
Brad Fitzpatrick’s Blog:
http://www.linuxjournal.com/article/7451?page=0,0
Memcached server source code:
https://github.com/memcached/memcached
Example of a memcached client source code:
http://bazaar.launchpad.net/~libmemcached-developers/libmemcached/trunk/files/head:/clients/
Memcached as a message queue:
http://broddlit.wordpress.com/2008/04/09/memcached-as-simple-message-queue/