Skip to content

Blog

Blowing out some cobwebs

It's been a long time since I last wrote anything new here, so long in fact, that BitBucket killed their Mercurial product which was hosting a number of my Open Source projets, Google killed their analytics product (and migrated it to something else that I don't care to adopt), and my hair has turned mostly silver!

It's long overdue for a bit of a dusting out of cobwebs, and I fancied a bit of a change from my normal weekend activities, so I sat down to do that this weekend.

For some time now, dependabot has been nagging me about some ruby ecosystem security issue or other as a result of this site being based on Jekyll, and I desired to never see such a thing again. So, as part of dusting things off, I migrated over to Material for MkDocs.

What's changed? Not too much really; I've updated some of the information about me and my projects and added a more recent photo. The fanciest part of this update is that there is now a Dark Mode!

Does this mean you're going to start blogging more regularly than once a decade!?

I'm not sure. If there's something you'd like me to talk about here, @-me on fosstodon and let's see.

a new domain

I'm letting thebrainroom.com lapse this year after holding it for 11 years. For those that weren't with me back then, The Brain Room was my consulting company, providing graphic and software design services.

When I joined OmniTI, I chose netevil.org to use for my blog in a kind of self-mocking move; I equated my efforts at taking over the world with superior software with the effectiveness of Dr. Evil. A lot of people thought I was styling myself after Dr. Evil, which wasn't quite right, but close enough :-)

Since I've been going around updating my account information with various services, I thought a change would be in order. I'm of the opinion that one should use their own name as part of their internet persona, I finally got a domain that did just that.

So here we are: wezfurlong.org. I've taken the opportunity to give things a really minor facelift, and also migrated comments to the Disqus service.

The move may cause a re-post of some of my articles to aggregating services; I apologize for that. netevil.org is not going anywhere anywhere soon, but if you have me on file using thebrainroom.com, you will need to update your information to wezfurlong.org instead.

I'll be at ZendCon 2008

Busy times here mean that I'm leaving it a bit late to say this, but I'll be at ZendCon this year too. I'll be giving the usual talk on PDO, but my main reason for attending this year is to sync up with other PHP folks and talk about where PHP is at and where it's going.

ZendCon has been consistently good, and I look forward to attending again this year... see you there?

Virtualization on OSX

I'm about to go on the road again and I've been getting my laptop updated to make sure I can function without internet access. For me that means that I need a linux environment. I've been using Parallels for this because it was the only option when I first got my macbook, and I'm not terribly impressed with its ability to run linux virtual machines.

First I have to say that my preferred usage for vms is to disable as much graphical UI as possible and login using the terminal; I want to avoid any excess resource usage because I'm on a laptop and I want better battery life.

Here's my gripe list:

  • poweroff spins the cpus up to 100% or more utilization and doesn't actually power the machine off.
    The reason? ACPI is only supported for vista guests. I'm rather bemused by this statement, because the whole point of ACPI is to virtualize certain types of hardware access--it should not be targeted to a particular OS.
  • Parallels Tools requires X to run.
    You can manually run the daemon but it spins the CPU trying to open the display. This means that you can't get time synchronization with the host unless you want to load your CPU
  • Shared folder performance sucks
    Mounting the host filesystem over NFS is faster, but kernel panics OSX (the latter is probably an OSX bug)

Outside of these issues, it's not bad though. I'm rather disappointed about the level of Linux support from Parallels--I had all the same problems a year ago and nothing seems to have changed. It's clear that their priority is in making the Windows VM experience nice and integrated, and that's their perogative.I've also tried VMWare Fusion, both the current stable and the beta versions; here's my gripe list:




  • lvm and vmware fusion appear to not like each other.
    Sometimes on reboot the vm filesystem is corrupt, especially the rpm database, and the image needs to be tossed and reinstalled without lvm. This is problematic because the default install for centos is lvm based.
  • vmware fusion freezes and can't be killed, can't be restarted.
    The resolution is a reboot of the laptop, which isn't reliable--powercycling is required.

I like vmware (I've been using it for many years), but it's not a happy camper on my laptop; I've uninstalled it.I've also tried VitualBox, and it's pretty good, but not perfect; gripes:




  • Only supports NAT networking, with no locally visible IP/network.
    This means that I can't ssh into individual machines by IP and have to set up port forwarding to get into them from my terminal.
  • Setting up port fowarding requires 3 long and tedious command invocations for each port

Some positives for VirtualBox:


  • ACPI support appears to be very complete
    The GUI even allows you to distinguish between an ACPI power off request and yanking out the power cord
  • SATA controller emulation
    This is faster than IDE/ATA emulation, which is all that Parallels offers. VMWare offers SCSI as an option, but that's a non-starter for me currently.
  • You can run VirtualBox vms completely headless and optionally export the console display using Remote Desktop
  • It's free to download and run

I'm sticking with Parallels for the time being; I think that VirtualBox might become my favourite once they've beefed up the networking support on OSX.I'll leave you with a couple of performance tips that should apply to any virtualization software:




  • Use fixed size virtual disks in preference to dynamically expanding disks. This will improve filesystem performance
  • Linux kernels by default have a high timer frequency that can torture the emulation and cause it have higher CPU load.
    If it makes sense for your vm, you can rebuild the kernel to use a lower frequency.
    If you're using centos, grab one of these pre-compiled kernels and reboot.
    This resulted in a drop from 30% CPU utilization when idle to 7% for me in Parallels, and a less significant drop in VirtualBox.

PHP Recap/Redux

I've been pretty damned busy of late (we're in the late stages with going gold for our next Message Systems product release), but have managed to be involved in a couple of things PHPish, although I haven't had much time to follow up and talk about them.

MIX

I was invited to be a panelist at Microsoft's MIX conference for a discussion on the traditional pain points of getting PHP to run in a Windows environment and interoperating with ASP apps, and how Microsoft have taken a number of steps to help make the experience nicer, by improving the developer experience with IIS, shipping FastCGI support and working with PHP core developers to identify and tune some hotspots in PHP. The panel was pretty well attended given that it was one of the last sessions of the conference. You can find a recording of this session online here.

At MIX, the hot news was mostly Silverlight. It really demos very nicely and really does seem like a Flash killer, particularly because the tools are very nicely done. The really nice thing about Silverlight from my perspective is not so much the eye candy (sweet as it may be), as the Dynamic Language Runtime (DLR). The DLR allows you to run a subset of "dot-net" on the client side (both Windows and Mac), including scripting languages like Ruby and Python. This allows for some interesting possibilities, from something as basic as being able to use the same languages on both the client and the server side (very compelling from a maintenance perspective), to being able to use multiple languages (and libraries written in those languages) and call between them in your client side app.

This stuff isn't really all that new (you've been able to do some of that with COM compatible scripting interfaces for years--there's also a PHPScript implementation for the brave), but what's exciting is that it is bundled up into a runtime that has eye candy and support for two common OS platforms. The trick is in the eye-candy; that feature will wow people and cause a more rapid adoption of Silverlight than if it was just the DLR on its own.

Speaking of the DLR, Andi Gutmans and myself made it to the excellent Just Glue IT! talk presented by Jim Hugunin and John Lam (I love that URL!), on Python and Ruby (and more) in the DLR on Silverlight. It was very informative as well as humorous and with some nice live demos. You might be wondering if we're interested in PHP running on the DLR. I would love to see it there, even if it was just a subset of the PHP that we know and love. Perhaps the Phalanger project might shift in that direction?

From an organizational point of view, MIX, the conference, was very well put together. Some nice touches included: a speaker room equipped with snacks (ranging from power/protein bars and fruit to chips and candy), soda (which is typically very difficult to find at a conference without having to walk out of the conference area and paying exorbitant prices. This is very important for me, as coffee is a migraine trigger.), and what really clinched it for me: red bull (including sugar free).

Another nice touch was a double-sided laminated name tag--those things have a habit of flipping around so that you can't read them and find out who you're talking to. There was also a "sandbox" for you to bail out from the conference and sit down and play with the new technologies (they provided a number of machines for that purpose) or just sit down and talk. Minus points for not having enough (any?) power strips in the sessions themselves though; it made it difficult to get some work done while absorbing a session.

php|tek

It felt like php|tek was the first true PHP conference I've done this year (and that might even be true--I didn't bother to look back and check), so I was looking forward to being there, and also to see a bit more of Chicago, although I was a little disappointed to find that the conference was set in the "airport town", just far enough away from the real city to make visiting it a hard prospect. Such is life.

I think the php|architect folks did a fine job considering that the hotel threw a few spanners (or wrenches for you American folks) into the works, pushing a number of people (myself included) out of the conference hotel proper and into its more plain cousin a block or two down the street. I particularly wanted to attend Jeff Moore's talk on maintainable code but there was no room--people literally fell out of the session when I opened the door to get in there.

It was good to catch up with people again (and slightly weird to meet people that I'd seen a couple of weeks earlier at MIX--it's a bit surreal to be jumping timezones and locations and still see the same people), and to meet some more PHPWomen face-to-face. We had fun in the PHP trivia competition, and some of us were roped in to doing a podcast which came out surprisingly coherent despite the amount of alcohol in the room (I suspect that's because it was largely consumed by one of the Facebook guys ;-)

As someone who's been doing these conferences for a few years now, it's interesting to see the increasing number of MacBook laptops in use. I didn't count everybody's laptop, but the areas I frequented during the conference appeared to have MacBooks in the majority.

One of these conferences, I'll make it to one of Joe Stagner's talks and be there for the whole thing--I've tried to make that happen for at least the last 4 that I've been to, but it hasn't managed to work out how I've planned it, so far.

Blog upgraded

I've had the code sitting around for months, but haven't had the time to push it to production until now.

This is the third generation of my blog and incorporates the data from its prior incarnations (including my original s9y based 'zlog). The new architecture uses PostgreSQL for the database, largely because I want to take advantage of its LISTEN/NOTIFY support.

Another change from the previous incarnation is that the authentication system is now OpenID based, which suits me a great deal because I'm too lazy to code user management just for my blog (previously, I used a pass-through to the PHP CVS repository for auth).

I've restructured my URLs and implemented some fairly neat URL rewriting rules to make sure that the old links continue to work, even those for zlog.thebrainroom.net. I'll blog about that in another entry.

I've also decided to completely do away with HTML form based admin and blogging, because I'm always really frustrated by the editing interface. Instead, I'm using Microsoft Word 2007 (running under Parallels on my Macbook Pro) to post entries using its blogging capability.

Goodnight Star

We said goodnight to Star for the last time tonight.

She was 13 years old and had been suffering from liver cancer for the last few months. Star was daughter of Bronte and mother to Lily. Bronte is still going strong at 15 years old back in England, and Lily (6) is now our alpha dog.

I like to remember the clan Mac Bronte like this:

We will miss you Star.

Coding for coders: API and ABI considerations in an evolving code base

As you may know, we have an MTA product that is designed to be extended by people writing modules in C/C++, Java and Perl. To facilitate this, not only do we need to write the code for product, but we also need to provide an API (Application Programming Interface) to our customers and partners so that they can build and run their modules.

There are a number of considerations when publishing an API:

Make the API easy to use

If the API is hard to understand then people will use it incorrectly, which might result in things blowing up in rare conditions that didn't come up in their testing. APIs tend to be hard to use if they have too many parameters or do too many things. It's a good idea to keep your API functions small and concise so that it's clear how they are supposed to work.

If you have a complex procedure with a number of steps, you should encapsulate those steps in another API function. This makes it easier to perform that procedure in the future.

Good documentation is a key component to ensuring that the APIs are used correctly; not only does it tell people how to use the API, it tells you how people are supposed to be using the API. More on that in a bit.

Don't change those APIs!

Once you've created an API and shipped your product and its gloriously detailed documentation, people will start to use it. There are two broad categories of people that will consume your API: customers that are building their own modules and partners that build modules to sell to other people running the software. Any changes that you make to the API will require the first group to update their code, recompile and re-deploy. The latter group will need to do the same, but need to ship the updated modules to their customers.

This is a pain for both groups of people. If the API changes you make are extensive it requires someone there to become familiar with those changes and figure out how to migrate their code from the old API to the new API in such a way that things still work. They may not have the resources to do this at the point where you release those changes, so you really need to avoid changing the API if you're shipping a critical bug fix.

ABI changes are bad too

ABI is an acronym for Application Binary Interface. It's similar to API, but the distinction is that API affects how you program against something, whereas ABI affects how the machine code expects things to work. If you're coming from a dynamic/scripting background, ABI doesn't really apply. Where it really matters is in cases where you're compiling your code and shipping the result. When you compile your code, the compiler figures out things like offsets of fields in structures, orders of parameters and sizes of of structures and so forth and encodes these things into the executable.

This is best illustrated with an example:

   struct foo {
      int a;
      int b;
   };
   int do_something(int param1, struct foo *foo);
   #define DOIT(a, b)   do_something(a, b)

Now, imagine that we ship another release where we've tweaked some code around:

   struct foo {
      int b;
      int a;
   };
   int do_something(struct foo *foo, int param1);
   #define DOIT(a, b)   do_something(b, a)

From an API perspective, things look the same (assuming that people only use the DOIT macro and not the do_something() function). If you don't rebuild the code, weird things will happen. For instance, the a and b fields in the foo structure have swapped places. That means that code compiled against the release 1 headers will be storing what it thinks is the value for a in the b slot. This can result in subtle to not-so-subtle behavior when the code is run, depending on what those functions do. The switch in the ordering of parameters to the do_something() function leads to similar problems.

These problems will vanish if the third party code is rebuilt against the new headers, but this requires that the updated code be re-deployed, and that may require additional resources, time and effort.

ABI changes are bad because they are not always immediately detected; the code will load and run until it either subtly corrupts memory or less subtly crashes because a pointer isn't where it used to be. The code paths that lead to these events may take some time to trigger.

In my contrived example above there was no reason to change the ordering of those things, and not changing them would have eliminated those problems.

Avoiding ABI and API breakage

A common technique for enhancing API calls is to do something like this:

   int do_this(int a);

and later:

   int do_this_ex(int a, int b);
   #define do_this(a)   do_this_ex(a, 0)

This neatly avoids an API change but breaks ABI: the do_this() function doesn't exist any more, so the program will break when that symbol is referenced. Depending on the platform, this might be at compile time or it might be at run time at the point where the function is about to be called for the first time.

If ABI is a concern for you, something like this is better:

   int do_this(int a) {
      return do_this_ex(a, 0);
   }

this creates a "physical" wrapper around the new API. You can keep the #define do_this() in your header file if you wish, and save an extra function call frame for people that are using the new API; people using the old ABI will still find that their linker is satisfied and that their code will continue to run.

Oh, and while I'm talking about making extended APIs, think ahead. If you think you're going to need an extra parameter in there one day, you can consider reserving it by doing something like this:

    int do_this(int a, int reserved);

and then documenting that reserved should always be 0. While that works, try to think a bit further ahead. Why might you need to extend that API? Will those projected changes require that additional APIs be added? If the answer is yes, then you shouldn't reserve parameters because what you'll end up with is code that does stuff like this:

   // I decided that I might add 4 parameters one day
   do_this(a, 0, 0, 0, 0);
   // but when that day arrived, I actually added a new function
   // that only needed 3
   do_this2(a, b, c);

Those reserved parameters add to your code complexity by making it harder to immediately grasp what's going on. What do those four zeros mean? Remember that one of the goals it to keep things simple.

You might have noticed that I called the new version of the API do_this2() instead of do_this_ex(). This also stems from thinking ahead. do_this_ex() is (by common convention) an extended form of do_this(), but what if I want to extend the extended version--do I call it do_this_ex_ex()? That sounds silly.

It's better to acknowledge API versioning as soon as you know that you need to do it. I'm currently leaning towards a numeric suffix like do_this2() for the second generation of the API and do_this3() for the third and so on.

Each time you do this, it's usually a good idea to implement the older versions of the APIs in terms of calls to the newer versions. This avoids code duplication which has a maintenance cost to you.

Of course, you'll make sure that you have unit tests that cover each of these APIs so that you can verify that they continue to work exactly as expected after you make your changes. At the very least, the unit tests should cover all the use cases in that wonderful documentation that you wrote--that way you know for sure that things will continue to work after you've made changes.

Structures and ABI

I got a little side tracked by talking about API function versioning. What about structures? I've already mentioned that changing the order of fields is "OK" from an API change perspective but not from an ABI. What about adding fields?

   struct foo {
      int a;
      int b;
   };

becoming:

   struct foo {
      int a;
      int b;
      int c;
   };

Whether this breaks ABI depends on how you intend people to use that structure. The following use case illustrates an ABI break:

   int main() {
      struct foo foo;
      int bar;
      do_something(&foo);
   }

Here, foo is declared on the stack, occupying 8 bytes in version 1 and 12 bytes (maybe more with padding, depending on your compiler flags) in version 2. Either side of foo on the stack are the stack frame and the bar variable. If we're running a program built against version 1 against version 2 libraries the do_something() function will misbehave when it attempts to access the c field of the structure. If the usage is read-only it will be reading "random" garbage from the stack--either something in the stack frame or perhaps even the contents of the bar variable, depending on the architecture and compilation flags. If it tries to update the c field then it will be poking into either the stack frame or the bar variable--stack corruption.

You can avoid this issue by using pointers rather than on-stack or global variables. There are two main techniques; the first builds ABI awareness into your APIs:

   struct foo {
      int size_of_foo;
      int a;
      int b;
   };
   int main() {
      struct foo foo;
      int bar;
      foo.size_of_foo = sizeof(foo);
      do_something(&foo);
   }

The convention here is to ensure that the first member of a structure is populated with its size. That way you can explicitly version your structures in your header files:

   struct foo_1 {
      int size_of_foo;
      int a;
      int b;
   };
   struct foo {
      int size_of_foo;
      int a;
      int b;
      int c;
   };
   int do_something(struct foo *foo) {
      if (foo->size_of_foo >= sizeof(struct foo)) {
         // we know that foo->c is safe to touch
      } else if (foo->sizeo_of_foo == sizeof(struct foo_1)) {
         // "old style" foo, do something a bit different
      }
   }

Microsoft are rather fond of this technique. Another technique, which can be used in conjunction with the ABI-aware-API, is to encapsulate memory management. Rather than declare the structures on the stack, the API consumer works with pointers:

   int main() {
      struct foo *foo;
      int bar;
      foo = create_foo();
      foo->a = 1;
      foo->b = 2;
      do_something(&foo);
      free_foo(foo);
   }

This approach ensures that all the instances of struct foo in the program are of the correct size in memory, so you wont run the risk of stack corruption. You'll need to ensure that create_foo() initializes the foo instance in such a way that th*e other API calls that consume it will treat it as a version 1 foo instance. Whether you do this by zeroing out the structure or building in ABI awareness is up to you.

Encapsulation

You can protect your API consumers from ABI breakage by providing a well encapsulated API. You do this by hiding the implementation of the structure and providing only accessor functions.

   struct foo; /* opaque, defined in a header file that you
                * don't ship to the customer */
   struct foo *create_foo();
   void free_foo(struct foo*);
   void foo_set_a(struct foo *, int value);
   int  foo_get_a(struct foo *);

By completely hiding the layout of the foo structure, the consumers code is completely immune to changes in the layout of that structure, because it is forced to use the accessor APIs that you provided.

You can see practical example of this in Solaris's ucred_get(3C) API.

Encapsulation has a trade-off though; if there are a lot of fields that you need to set in a structure, you might find that the aggregate cost of making function calls to get and set those values becomes significant. My usual disclaimer applies though--don't code it one way because you think it will run faster--do it after you've profiled the code and when you know that it will be faster. It's better to opt for maintainability first, otherwise you might as well be hand-coding in assembly language.

Summing up

It can be hard to retrofit API and ABI compatibility; it's best to plan for it early on, even if you just decide that you're not going to do it.

Projects typically adopt a strategy along the lines of: no ABI (and thus API) breaks in patchlevel releases. Avoid ABI (and this API) breaks in minor releases. API will only break in major releases after appropriate deprecation notices are published and a suitable grace period observed to facilitate migration.

Folks that are truly committed to API/ABI preservation will have a long deprecation period and will add an extra restriction--API/ABI changes will be removals only.

API/API preservation is a challenge, but if you get it right, your API consumers will love you for it.

I'll leave you with some bullet points:

  • Avoid changing APIs.
  • Avoid changing ABIs.
  • It's particularly important to preserve ABI compatibility if you're shipping a patch level release, because people tend to put less effort into QA and might overlook a breakage.
  • If you need to expand, spawn a new generation of APIs rather than mutating existing ones.
  • If you need to expand structures, don't change the ordering of fields, add to the end.
  • encapsulate structures with APIs if you can.
  • Unit tests are essential
  • Documentation is very important

Help build a public UK postcode database

Via BoingBoing:

New Public Edition maps are trying to create a freely usable UK postcode database. The British Post Office owns the database of postcodes and their corresponding coordinates. That means that your website can only use post-codes if you buy a license from the Post Office.

New Public Edition (along with a similar project, Free the Postcode) is trying to solve this. They have 1950s-era public-domain maps and they ask you to locate your house (or childhood home) on it and key in your post-code. They do the rest, eventually building out a complete database of every postcode in Britain.

The resulting data will be released as purely public domain--no restrictions whatsoever on re-use.

I just filled in a couple of postcodes from previous residences, and it was quite interesting to see how the area that I grew up in has changed since 1950; it looks like it used to be one large farm that was broken up into a couple of smaller farms that have now become residential areas. It's a logical progression really, but having a date like 1950 gives a sense of dimension--it's easy to think that that change happened "hundreds of years ago", but it's much more recent than that.

So, if you're in the UK, or lived there for a while, please take a couple of minutes to visit New Public Edition, fill in your postcode, and perhaps gain a better understanding of the places you've lived.