Research:Memcached Optimization

Contact

Yuval Meir

Tel Aviv University

Limor Gavish

Tel Aviv University

Research:Projects

This page documents a abandoned research project.

Key Personnel

Yuval Meir, Tel Aviv University, yuvalme AT gmail.com, wil AT post.tau.ac.il

Limor Gavish, Tel Aviv University lgavish AT gmail.com, limorgav AT post.tau.ac.il

Nezer Zaidenberg, Tel Aviv University, nzaidenberg AT mac.com

Project Summary

We introduce CCS, an innovative cluster cache synchronization framework. Cluster cache synchronization has grown in popularity with the introduction of "memcached". With the growing bit rate of networks, clusters of inexpensive computers often take the role of midrange and high-end servers. However, where disk I/O is concerned, a synchronized caching environment for the entire cluster must be provided - as only by a shared cluster caching approach the disk I/O on a cluster will be used with similar efficiency to a single server and single I/O cache. ``memcached” has become the internet standard for cache synchronization on a cluster. However,``memcached” (currently) only supports LRU caching algorithm. CCS engine, built on top of ``memcached”, allows different caching algorithms to be used on a cluster environment. In this paper we describe CCS environment and experimental results for cache synchronization across cluster.

Methods

Our research tries to reduce response time and load on web servers and databases that serve them. Therefore, we need statistical data on which links are being accessed and in which order (per client). Thus, we would be extremely thankful if you could provide us with a month worth of access logs of the English language Wiktionary project. We chose to test our algorithms on the Wiktionary database due to the relatively modest volume of data, which simplifies the setup of the testing environment.

Obviously you would not want to release identifying data, so the logs should be sanitized. the user names can be completely removed, while we would like to have the IP addresses preserved to a degree that allows us to know whether two requests came from the same IP or not.

This can be done, for example, by generating a single block (200 bytes or so will be enough) of random data, and replacing each IP address in the log with the MD5 hash of the concatenation of the block and the IP address. the block should be deleted afterwards and not sent to us.

Alternatively, you could just replace the first two octets of each IP with zeroes, but we prefer the previous method because it avoids not distinguishing between users with identical last 2 octets.

Dissemination

We are going to publish an academic research.

Wikimedia Policies, Ethics, and Human Subjects Protection

Since we require anonymous data (see 'Methods'), there should be no ethical issues of concern.

Benefits for the Wikimedia community

Obviously, since Wikimedia projects use Memcached to cache their databases and file systems, our optimization of the Memcached algorithm may improve the response time and reduce latency of fetching pages from the database, meaning web pages will be displayed faster.