Pressflow

Making Drupal and Pressflow more mundane

Drupal and Pressflow have too much magic in them, and not the good kind. On the recent Facebook webcast introducing HipHop PHP, their PHP-to-C++ converter, they broke down PHP language features into two categories: magic and mundane. The distinction is how well each capability of PHP, a dynamic language, translates to a static language like C++. “Mundane” features translate well to C++ and get a big performance boost in HipHop PHP. “Magic” features are either unsupported, like eval(), or run about as fast as today’s PHP+APC, like call_user_func_array().

Anticipage: scalable pagination, especially for ACLs

Pagination is one of the hardest problems for web applications supporting access-control lists (ACLs). Drupal and Pressflow support ACLs through the node access system.

Problems with traditional pagination

  • Because pagination uses row offsets into the results, browsing listings where newly published items get added to the beginning of the results creates “page drift.” Page drift is where a user already browsing through paginated results sees, for example, items E, D, and C on page one, waits awhile, clicks to the next page, and sees items C, B, and A. Going back to page one again shows F (newly published), E, and D. Item C “drifted” to page two while the user was reading page one. If new items are published frequently enough, pagination can become unusable due to this drifting effect.
  • Even if content and ordering are fully indexed, jumping n rows into the results remains inefficient; it scales linearly with depth into pagination.
  • Paginating sets where the content and ordering are not fully indexed is even worse, often to the point of being unusable.
  • The design is optimized around visiting arbitrary page offsets, which does not reflect user needs. Users only need to make relative jumps in pagination of up to 10 pages (or so) in either direction or to start from the end of the results. (If users are navigating results by hopping to arbitrary pages to drill down to what they need, there are other flaws in the system.)

Intelligent memcached and APC interaction across a cluster

Anyone experienced with high-performance, scalable PHP development is familiar with APC and memcached. But used alone, they each have serious limitations:

APC

  • Advantages
    • Low latency
    • No need to serialize/unserialize items
    • Scales perfectly with more web servers
  • Disadvantages
    • No enforced consistency across multiple web servers
    • Cache is not shared; each web server must generate each item

memcached

  • Advantages
    • Consistent across multiple web servers
    • Cache is shared across all web servers; items only need to be generated once
  • Disadvantages
    • High latency
    • Requires serializing/unserializing items
    • Easily shards data across multiple web servers, but is still a big, shared cache

Combining the two

Traditionally, application developers simply think about consistency needs. If consistency is unnecessary (or the scope of the application is one web server), APC is great. Otherwise, memcached is the choice. There is, however, a third, hybrid option: use memcached as a coordination system for invalidation with APC as the main item cache. This functions as a loose L1/L2 cache structure. To borrow terminology from multimaster replication systems, memcached stores “tombstone” records.

Benchmarks, hot off the Pressflow

Independent benchmarks by Josh Koenig from Chapter Three show a 3000x throughput increase and a 40x load reduction moving from plain Drupal to Pressflow + Varnish. His testing was performed on a small Amazon EC2 instance, demonstrating how Pressflow can deliver internet-scale performance on modest, inexpensive hardware. Pressflow is able to deliver this class of performance because it’s optimized to support Varnish and other enterprise-grade web infrastructure tools in ways that standard Drupal cannot.

With Pressflow’s API compatibility with Drupal, Josh’s move from Drupal to Pressflow on his project didn’t require any coding or extensive testing. He just replaced Drupal core with Pressflow. (It’s no harder than a minor Drupal update.)

For single-server setups in the Amazon EC2 cloud, Josh’s Project Mercury AMI provides a click-and-run, configured setup with the Pressflow + Varnish stack. For more complex setups, Four Kitchens provides infrastructure consulting services on the Pressflow system.

Need to scale Drupal on EC2? Check out Chapter Three's Mercury project

Josh Koenig from Chapter Three has made pre-release EC2 AMIs (pre-packaged virtual machine images) for Mercury, a project to combine Four Kitchens’ Drupal-derived, high-performance Pressflow with Varnish, Cache Router, and memcached. Initial results show it easily saturating the EC2’s pipe. Mercury instances directly update their Pressflow releases from the Four Kitchens Bazaar server.

Mercury is an exciting project for anyone who needs to run a high-traffic, Drupal-based site without having to configure a bunch of caching systems.

Pressflow 6 now offers direct downloads

For quite some time, Four Kitchens has provided Pressflow releases to its large-scale clients and anyone interested enough to request a copy. We provided limited access to copies so that we could understand what organizations expected of Pressflow, how they wanted to use it, and so that we could keep all users updated with the latest security, bug-fix, and feature releases.

Improvements to the Materialized View API

An eye-catching graphic, largely irrelevant to this blog post.An eye-catching graphic, largely irrelevant to this blog post.

The Materialized View API (related posts) provides resources for pre-aggregation and indexing of data for use in complex queries. It does this by managing denormalized tables based on data living elsewhere in the database (and possibly elsewhere). As such, materialized views (MVs) must be populated and updated using large amounts of data. As users change data on the site, MVs must be intelligently updated to avoid complete (read: very slow) rebuilds. Part of performing these intelligent updates is calculating how user changes to data affect MVs in use. Until now, these updates had limitations in scalability and capability.

David's Epic Presentation Megapost

The hidden costs of proprietary software: #1 optimizing around licensing

Articles abound about the “hidden costs” of using free, open-source software. Many of them are sponsored by companies with a stake in their own proprietary solutions — and they’re responding to the threat of increasing enthusiasm about free alternatives. Some of the claims are legitimate; others are FUD.

Here at Four Kitchens, we’re on the opposite side. We advocate using free software like Drupal (and our own free-software derivative, Pressflow) whenever possible. When it’s not immediately possible, it’s a hard decision between writing a free solution and going proprietary. We enjoy the freedom of free software for many reasons, especially because it doesn’t feel like we’re fighting the company behind the software in order to get the most out of it.

What makes Pressflow scale: #1 faster core queries

Drupal has a number of queries with unfortunate scalability profiles.

URL alias counting (one instance in core)

The biggest offender in Drupal 5 and Drupal 6 is the query counting the number of URL aliases: SELECT COUNT(<em>) FROM url_alias. This query dates back to when nearly every Drupal site ran on MyISAM, which is important because MyISAM keeps an exact count of the number of rows in every table, making SELECT COUNT (</em>) FROM [table] an O(1) (read: fast, constant-time) operation.