Skip to content

2006

Coding for source control

Hot on the heels of my Coding for Coders entry (focused on C), here's another on coding for source control.

When you have a large code base in a source control system (like subversion), you'll find that things go easier if you adopt a few coding practices that work in-hand with the way that the version control works.

Embrace branches and tags

You really should investigate how to use the branching and tagging feature in your source control system. A typical practice is to do development in trunk and have a branch for each major version of the code (eg: a 1.0 branch, 2.0 branch and so on), tagging that branch each time you reach a significant point in development and each time you ship code. Depending on your project, you might branch for minor versions too (eg: 1.2, 1.3).

Think in terms of changesets

If you're working on a bug fix or implementing a feature, it's good practice to distill the net development effort down to a single patch for the tree. The set of changes in that patch is the changeset that implements the bug fix or feature.

Once you have the changeset, you can look at applying it to one of your branches so that you can ship the fixed/enhanced product.

Trivial fixes can usually be implemented with a single commit to the repository, but more complex changesets might span a number of commits. It's important to track the commits so that your changeset is easier to produce.

We use trac for our development ticket tracking. It's easy to configure trac/subversion to add a commit hook that allows developers to reference a ticket in their commit messages and then have all the commits related to that ticket show up as comments when viewing the ticket. You can then merge each commit into your working copy and then check in the resulting changeset.

If one of more of your developers are making extensive changes, it's a good idea for them to do their work in their own branches. That way they won't step on each others toes during development. You might also want to look at creating a branch per big ticket--this will allow you to exploit the diffing/merging features of your source control system to keep track of the overall changeset.

Code with merging in mind

When you're making code changes, try to think ahead to how the patch will look, and how easy it will be for your source control system to manage merging that code.

A few suggestions:

  • if you have a list of things to update, break the list up so that each item has its own line.
  • if the list has a separator character (eg: a comma), include the separator on the last line of the list.
  • if you're adding to a list, add to the end if possible.
  • avoid changing whitespace, try to have your patch reflect functional changes only.

Your goal is to minimize the patch so that it represents the smallest possible set of changed lines. If you can avoid touching peripheral lines around your change set, you reduce the disk of running into conflicts when you merge.

Get into the habit of diffing your changes against the repository while your work, and certainly always diff before you commit. If you find in changed lines that are not essential for the patch (whitespace in particular), take them out!

Here's an example from a makefile:

      SOURCES = one.c two.c three.c

This is nice and readable at first, but over time this line may grow to include a large number of source files. People will tend to add to the end at first, and perhaps alphabetically when the number of files increases. The resulting diff shows a single modified line but won't really show you what changed on that line. Things get difficult when two changeset affect that line; you'll get a conflict because the source control system doesn't know how to merge them.

      # this is better
      SOURCES = \\
        one.c \\
        two.c \\
        three.c \\

Each item now has its own line. By getting into the habit of adding at the end, complete with separator or continuation character you help the merge process: each item you add will be a single line diff, and it will know that you're adding it at the end, improving the chances of a successful merge a great deal.

Adding at the end isn't the golden rule so much as making sure that everyone adds consistently. Often, order is important, so adding at the end isn't going to help you. By adding in a consistent manner, you reduce the chances of touching the same lines as another changeset and thus reduce the chances of a conflict.

Here's the same example, but in PHP:

      $foo = array("one", "two", "three");

better:

      $foo = array(
              "one",
              "two",
              "three",
             );

Dangling commas are good! :)

Keep the diff readable

Don't take the concept of small diffs too literally--if you can express your change on a single line that is 1024 characters long you've made the merge easier at the expense of making it really hard to review what the change does. This basically boils down to making sure that you stick to the coding standards that have been established for the project.

Don't sacrifice human readability for the sake of easier merging.

If you find that you need to merge a changeset to more than one branch (say you have a bug fix to apply to 2.0 and 2.0.1) then it's often easier to merge to 2.0 first, resolve any conflicts, commit and merge the 2.0 changeset into 2.0.1 rather than the trunk changeset direct to 2.0.1.

These practices aren't obtrusive and will help you when you need to merge a changeset from one branch to another.

I don't pretend to know everything, these are just a couple of tidbits I thought I'd share. If you have other similar advice, I'd like to hear it--feel free to post a comment.

Coding for coders: API and ABI considerations in an evolving code base

As you may know, we have an MTA product that is designed to be extended by people writing modules in C/C++, Java and Perl. To facilitate this, not only do we need to write the code for product, but we also need to provide an API (Application Programming Interface) to our customers and partners so that they can build and run their modules.

There are a number of considerations when publishing an API:

Make the API easy to use

If the API is hard to understand then people will use it incorrectly, which might result in things blowing up in rare conditions that didn't come up in their testing. APIs tend to be hard to use if they have too many parameters or do too many things. It's a good idea to keep your API functions small and concise so that it's clear how they are supposed to work.

If you have a complex procedure with a number of steps, you should encapsulate those steps in another API function. This makes it easier to perform that procedure in the future.

Good documentation is a key component to ensuring that the APIs are used correctly; not only does it tell people how to use the API, it tells you how people are supposed to be using the API. More on that in a bit.

Don't change those APIs!

Once you've created an API and shipped your product and its gloriously detailed documentation, people will start to use it. There are two broad categories of people that will consume your API: customers that are building their own modules and partners that build modules to sell to other people running the software. Any changes that you make to the API will require the first group to update their code, recompile and re-deploy. The latter group will need to do the same, but need to ship the updated modules to their customers.

This is a pain for both groups of people. If the API changes you make are extensive it requires someone there to become familiar with those changes and figure out how to migrate their code from the old API to the new API in such a way that things still work. They may not have the resources to do this at the point where you release those changes, so you really need to avoid changing the API if you're shipping a critical bug fix.

ABI changes are bad too

ABI is an acronym for Application Binary Interface. It's similar to API, but the distinction is that API affects how you program against something, whereas ABI affects how the machine code expects things to work. If you're coming from a dynamic/scripting background, ABI doesn't really apply. Where it really matters is in cases where you're compiling your code and shipping the result. When you compile your code, the compiler figures out things like offsets of fields in structures, orders of parameters and sizes of of structures and so forth and encodes these things into the executable.

This is best illustrated with an example:

   struct foo {
      int a;
      int b;
   };
   int do_something(int param1, struct foo *foo);
   #define DOIT(a, b)   do_something(a, b)

Now, imagine that we ship another release where we've tweaked some code around:

   struct foo {
      int b;
      int a;
   };
   int do_something(struct foo *foo, int param1);
   #define DOIT(a, b)   do_something(b, a)

From an API perspective, things look the same (assuming that people only use the DOIT macro and not the do_something() function). If you don't rebuild the code, weird things will happen. For instance, the a and b fields in the foo structure have swapped places. That means that code compiled against the release 1 headers will be storing what it thinks is the value for a in the b slot. This can result in subtle to not-so-subtle behavior when the code is run, depending on what those functions do. The switch in the ordering of parameters to the do_something() function leads to similar problems.

These problems will vanish if the third party code is rebuilt against the new headers, but this requires that the updated code be re-deployed, and that may require additional resources, time and effort.

ABI changes are bad because they are not always immediately detected; the code will load and run until it either subtly corrupts memory or less subtly crashes because a pointer isn't where it used to be. The code paths that lead to these events may take some time to trigger.

In my contrived example above there was no reason to change the ordering of those things, and not changing them would have eliminated those problems.

Avoiding ABI and API breakage

A common technique for enhancing API calls is to do something like this:

   int do_this(int a);

and later:

   int do_this_ex(int a, int b);
   #define do_this(a)   do_this_ex(a, 0)

This neatly avoids an API change but breaks ABI: the do_this() function doesn't exist any more, so the program will break when that symbol is referenced. Depending on the platform, this might be at compile time or it might be at run time at the point where the function is about to be called for the first time.

If ABI is a concern for you, something like this is better:

   int do_this(int a) {
      return do_this_ex(a, 0);
   }

this creates a "physical" wrapper around the new API. You can keep the #define do_this() in your header file if you wish, and save an extra function call frame for people that are using the new API; people using the old ABI will still find that their linker is satisfied and that their code will continue to run.

Oh, and while I'm talking about making extended APIs, think ahead. If you think you're going to need an extra parameter in there one day, you can consider reserving it by doing something like this:

    int do_this(int a, int reserved);

and then documenting that reserved should always be 0. While that works, try to think a bit further ahead. Why might you need to extend that API? Will those projected changes require that additional APIs be added? If the answer is yes, then you shouldn't reserve parameters because what you'll end up with is code that does stuff like this:

   // I decided that I might add 4 parameters one day
   do_this(a, 0, 0, 0, 0);
   // but when that day arrived, I actually added a new function
   // that only needed 3
   do_this2(a, b, c);

Those reserved parameters add to your code complexity by making it harder to immediately grasp what's going on. What do those four zeros mean? Remember that one of the goals it to keep things simple.

You might have noticed that I called the new version of the API do_this2() instead of do_this_ex(). This also stems from thinking ahead. do_this_ex() is (by common convention) an extended form of do_this(), but what if I want to extend the extended version--do I call it do_this_ex_ex()? That sounds silly.

It's better to acknowledge API versioning as soon as you know that you need to do it. I'm currently leaning towards a numeric suffix like do_this2() for the second generation of the API and do_this3() for the third and so on.

Each time you do this, it's usually a good idea to implement the older versions of the APIs in terms of calls to the newer versions. This avoids code duplication which has a maintenance cost to you.

Of course, you'll make sure that you have unit tests that cover each of these APIs so that you can verify that they continue to work exactly as expected after you make your changes. At the very least, the unit tests should cover all the use cases in that wonderful documentation that you wrote--that way you know for sure that things will continue to work after you've made changes.

Structures and ABI

I got a little side tracked by talking about API function versioning. What about structures? I've already mentioned that changing the order of fields is "OK" from an API change perspective but not from an ABI. What about adding fields?

   struct foo {
      int a;
      int b;
   };

becoming:

   struct foo {
      int a;
      int b;
      int c;
   };

Whether this breaks ABI depends on how you intend people to use that structure. The following use case illustrates an ABI break:

   int main() {
      struct foo foo;
      int bar;
      do_something(&foo);
   }

Here, foo is declared on the stack, occupying 8 bytes in version 1 and 12 bytes (maybe more with padding, depending on your compiler flags) in version 2. Either side of foo on the stack are the stack frame and the bar variable. If we're running a program built against version 1 against version 2 libraries the do_something() function will misbehave when it attempts to access the c field of the structure. If the usage is read-only it will be reading "random" garbage from the stack--either something in the stack frame or perhaps even the contents of the bar variable, depending on the architecture and compilation flags. If it tries to update the c field then it will be poking into either the stack frame or the bar variable--stack corruption.

You can avoid this issue by using pointers rather than on-stack or global variables. There are two main techniques; the first builds ABI awareness into your APIs:

   struct foo {
      int size_of_foo;
      int a;
      int b;
   };
   int main() {
      struct foo foo;
      int bar;
      foo.size_of_foo = sizeof(foo);
      do_something(&foo);
   }

The convention here is to ensure that the first member of a structure is populated with its size. That way you can explicitly version your structures in your header files:

   struct foo_1 {
      int size_of_foo;
      int a;
      int b;
   };
   struct foo {
      int size_of_foo;
      int a;
      int b;
      int c;
   };
   int do_something(struct foo *foo) {
      if (foo->size_of_foo >= sizeof(struct foo)) {
         // we know that foo->c is safe to touch
      } else if (foo->sizeo_of_foo == sizeof(struct foo_1)) {
         // "old style" foo, do something a bit different
      }
   }

Microsoft are rather fond of this technique. Another technique, which can be used in conjunction with the ABI-aware-API, is to encapsulate memory management. Rather than declare the structures on the stack, the API consumer works with pointers:

   int main() {
      struct foo *foo;
      int bar;
      foo = create_foo();
      foo->a = 1;
      foo->b = 2;
      do_something(&foo);
      free_foo(foo);
   }

This approach ensures that all the instances of struct foo in the program are of the correct size in memory, so you wont run the risk of stack corruption. You'll need to ensure that create_foo() initializes the foo instance in such a way that th*e other API calls that consume it will treat it as a version 1 foo instance. Whether you do this by zeroing out the structure or building in ABI awareness is up to you.

Encapsulation

You can protect your API consumers from ABI breakage by providing a well encapsulated API. You do this by hiding the implementation of the structure and providing only accessor functions.

   struct foo; /* opaque, defined in a header file that you
                * don't ship to the customer */
   struct foo *create_foo();
   void free_foo(struct foo*);
   void foo_set_a(struct foo *, int value);
   int  foo_get_a(struct foo *);

By completely hiding the layout of the foo structure, the consumers code is completely immune to changes in the layout of that structure, because it is forced to use the accessor APIs that you provided.

You can see practical example of this in Solaris's ucred_get(3C) API.

Encapsulation has a trade-off though; if there are a lot of fields that you need to set in a structure, you might find that the aggregate cost of making function calls to get and set those values becomes significant. My usual disclaimer applies though--don't code it one way because you think it will run faster--do it after you've profiled the code and when you know that it will be faster. It's better to opt for maintainability first, otherwise you might as well be hand-coding in assembly language.

Summing up

It can be hard to retrofit API and ABI compatibility; it's best to plan for it early on, even if you just decide that you're not going to do it.

Projects typically adopt a strategy along the lines of: no ABI (and thus API) breaks in patchlevel releases. Avoid ABI (and this API) breaks in minor releases. API will only break in major releases after appropriate deprecation notices are published and a suitable grace period observed to facilitate migration.

Folks that are truly committed to API/ABI preservation will have a long deprecation period and will add an extra restriction--API/ABI changes will be removals only.

API/API preservation is a challenge, but if you get it right, your API consumers will love you for it.

I'll leave you with some bullet points:

  • Avoid changing APIs.
  • Avoid changing ABIs.
  • It's particularly important to preserve ABI compatibility if you're shipping a patch level release, because people tend to put less effort into QA and might overlook a breakage.
  • If you need to expand, spawn a new generation of APIs rather than mutating existing ones.
  • If you need to expand structures, don't change the ordering of fields, add to the end.
  • encapsulate structures with APIs if you can.
  • Unit tests are essential
  • Documentation is very important

parser and lexer generators for PHP

[Update: I've put these parser/lexer tools on BitBucket and Github; enjoy!]

From time to time, I find that I need to put a parser together. Most of the time I find that I need to do this in C for performance, but other times I just want something convenient, like PHP, and have been out of luck.

This thanksgiving I set out to remedy this and adapted lemon to optionally emit PHP code, and likewise with JLex.

You need a C compiler to build lemon and a java compiler and runtime to build and run JLexPHP, but after having translated your .y and .lex files with these tools, you're left with a pure PHP parser and lexer implementation.

The parser and lexer generators are available under a BSDish license, from both BitBucket and Github:

See enclosed README files for more information.

Help build a public UK postcode database

Via BoingBoing:

New Public Edition maps are trying to create a freely usable UK postcode database. The British Post Office owns the database of postcodes and their corresponding coordinates. That means that your website can only use post-codes if you buy a license from the Post Office.

New Public Edition (along with a similar project, Free the Postcode) is trying to solve this. They have 1950s-era public-domain maps and they ask you to locate your house (or childhood home) on it and key in your post-code. They do the rest, eventually building out a complete database of every postcode in Britain.

The resulting data will be released as purely public domain--no restrictions whatsoever on re-use.

I just filled in a couple of postcodes from previous residences, and it was quite interesting to see how the area that I grew up in has changed since 1950; it looks like it used to be one large farm that was broken up into a couple of smaller farms that have now become residential areas. It's a logical progression really, but having a date like 1950 gives a sense of dimension--it's easy to think that that change happened "hundreds of years ago", but it's much more recent than that.

So, if you're in the UK, or lived there for a while, please take a couple of minutes to visit New Public Edition, fill in your postcode, and perhaps gain a better understanding of the places you've lived.

HTTP POST from PHP, without cURL

Update May 2010: This is one of my most popular blog entries, so it seems worthwhile to modernize it a little. I've added an example of a generic REST helper that I've been using in a couple of places below the original do_post_request function in this entry. Enjoy!

I don't think we do a very good job of evangelizing some of the nice things that the PHP streams layer does in the PHP manual, or even in general. At least, every time I search for the code snippet that allows you to do an HTTP POST request, I don't find it in the manual and resort to reading the source. (You can find it if you search for "HTTP wrapper" in the online documentation, but that's not really what you think you're searching for when you're looking).

So, here's an example of how to send a POST request with straight up PHP, no cURL:

I'm looking for another Dark Apprentice

I'm looking for someone who wants to hone their existing 3+ years of C hacking and debugging skills on some of the fastest, most highly stressed core infrastructure applications ever created.

The full job description is available on the OmniTI Careers page.

A successful applicant for the position will join the ranks of my Dark Apprentices and will have the opportunity to learn and develop skills such as:

  • Performant, scalable thinking. Writing and troubleshooting code that runs in high stress environments.
  • Sith debugging. Mastering the inner mysteries to deduce ways to effectively reproduce and resolve otherwise impossible problems.
  • All the fun and happy details of the various email specs.
  • Dry wit. You'll have the option of picking up some of my British humour.

There's plenty of scope for developing these skills and more.

If you're interested in this position, or know someone else that might be, please direct resumes to jobs[at]messagesystems.com.

(I hope the folks an planets mysql and php don't mind the cross posting; we do do work with both PHP and mysql, so it's not totally off topic. Thanks for reading!)

On the road to San Jose for ZendCon'06

I'm currently sitting in Atlanta airport (because it's on the way to San Jose from BWI, obviously).

I really enjoyed last years conference, so I have great expectations this year. I'll be giving the short version of my PDO talk again this year (but this time, in shiny Keynote on my shiny macbook).

I think I'll try to attend the session "Managing PHP and PHP Applications on Windows" to see what the folks at Microsoft have to say about that, and "Unlocking The Enterprise Using PHP and Messaging and Queuing" to see what IBM have planned there. Outside of the sessions, I'm going to sit down with Andrei and Sara to discuss implementing Unicode for PDO in PHP 6.

Ah, time to board. See you there if you're there!

Background/batch/workflow processing with PDO::PGSQL

One of the other things I've been looking it as ways to implement background processing in PHP. In my recent talk on sending mail from php I mention that you want to avoid sending mail directly from a web page. A couple of people have asked me how to implement that, and one of the suggestions I have is to queue your mail in a database table and have some other process act on that table.

The idea is that you have a PHP CLI script that, in an infinite loop, sleeps for a short time then polls the database to see if it needs to do some work. While that will work just fine, wouldn't it be great if the database woke you up only when you needed to do some work?

I've been working on a patch originally contributed by David Begley that adds support for LISTEN/NOTIFY processing to the Postgres PDO driver. With the patch you can write a CLI script that looks a bit like this:

<?php
   $db = new PDO('pgsql:');
   $db->exec('LISTEN work');
   dispatch_work();
   while (true) {
      if (is_array($db->pgsqlGetNotify(PDO::FETCH_NUM, 360))) {
          dispatch_work();
      }
   }
?>

This script will effectively sleep for 360 seconds, or until someone else issues a 'NOTIFY work' query against the database, like this:

<?php
   $db->beginTransaction();
   $q = $db->prepare('insert into work(...) values (...)');
   $q->execute($params);
   $db->exec('NOTIFY work');
   $db->commit();
?>

When the transaction commits, the CLI script will wake up and return an array containing 'work' and a process id; the script will then call dispatch_work() which is some function that queries the database to find out exactly what it needs to do, and then does it.

This technique allows you to save CPU resources on the database server by avoiding repeated queries against the server. The classic polling overhead trade-off is to increase the time interval between polls at the cost of increased latency. The LISTEN/NOTIFY approach is vastly superior; you do zero work until the database wakes you up to do it--and it wakes you up almost immediately after the NOTIFY statement is committed. The transactional tie-in is nice; if something causes your insert to be rolled back, your NOTIFY will roll-back too.

Once PHP 5.2.0 is out the door (it's too late to sneak it into the release candidate), you can expect to see a PECL release of PDO::PGSQL with this feature.