Skip to content

Home

Jyte.com: digg crossed with social networking

http://jyte.com/: it's kinda fun: you sign in with your OpenID and make claims about things (doesn't matter what) and vote on those claims. You can also give credit to other people, join and create groups and record contacts and relationships.

Sounds pretty standard, but what I find interesting is that there are a set of APIs for querying group membership. This could potentially allow an application to restrict access to an invitation-only group defined on jyte, and then only people who are members of those groups will be allowed through.

Another interesting area is the upcoming credit API; it will be possible to query the overall credit score of an identity. I'm wondering if a high enough credit score will be acceptable as "proof enough" that a user is not a bot, and that we could then remove the captcha step for those users.

It will be interesting to see where these concepts go as the year progresses.

OpenID (and TypeKey) using native OpenSSL functions in PHP

Update: fixed a flaw in my implementation

I may have hinted at this a couple of times before, but now I'm actually saying something useful about it... I have a patch (php-openid.diff, for PHP 5, might also apply to PHP 4) for the openssl extension that makes it easier to build OpenID and TypeKey authentication support into your PHP apps.

I don't have a canned solution for you to deploy, but I can give you some pointers on how to use these bits. I'm assuming that you know a bit about how OpenID works.

This worked for me in my tests; it's not necessarily the most optimal way to do it, but it highlights how it works.

Thanks to the folks at JanRain, there was a flaw in my implementation that is now fixed.

Associate

Association allows you to generate a relationship with an OpenID server by generating and exchanging keys. It has nothing to do with an authentication request per-se; the result of the request can be used to authenticate a user later (and other users of that same identity server). The results of the association should be cached.

If you haven't already associated with an OpenID server, you'll want to do something like the following:

<?php
  // convert openssl unsigned big endian into signed two's complement
  // notation
  function btowc($str) {
    if (ord($str[0]) > 127) {
      return "\\x00" . $str;
    }
    return $str;
  }
  $assoc = array();
  $crypto = array();
  $dh = openssl_dh_generate_key(OPENID_P_VALUE, '2');
  foreach (openssl_dh_get_params($dh) as $n => $v) {
    $crypto[$n] = openssl_bignum_to_string($v, 10);
  }
  $params = array(
     'openid.mode' => 'associate',
     'openid.assoc_type' => 'HMAC-SHA1',
     'openid.session_type' => 'DH-SHA1',
     'openid.dh_modulus' => base64_encode(
                btwoc(openssl_bignum_to_string(OPENID_P_VALUE))),
     'openid.dh_gen' => base64_encode(
                btwoc(openssl_bignum_to_string('2'))),
     'openid.dh_consumer_public' => base64_encode(btwoc(
                openssl_bignum_to_string($crypto['pub_key']))),
  );
  $r = perform_openid_rpc($server, $params); // talk to server
  if ($r['session_type'] == 'DH-SHA1') {
    $s_pub = openssl_bignum_from_bin(
               base64_decode($r['dh_server_public']));
    $dh_sec = openssl_dh_compute_key($dh, $s_pub);
    if ($dh_sec === false) {
      do {
        $err = openssl_error_string();
        if ($err === false) {
          break;
        }
        echo "$err<br>\\n";
      } while (true);
    }
    $sh_sec = sha1($dh_sec, true);
    $enc_mac = base64_decode($r['enc_mac_key']);
    $secret = $enc_mac ^ $sh_sec;
    $assoc['secret'] = $secret;
    $assoc['handle']  = $r['assoc_handle'];
    $assoc['assoc_type'] = $r['assoc_type'];
    $assoc['expires'] = time() + $r['expires_in'];
  } else {
    $assoc = false;
  }
?>

Performing Authentication

Authentication is browser based; the user enters their URL into your site, and you then redirect to their OpenID server with a sprinkle of magic sauce in the get parameters. Here's how you create the sauce:

<?php
  // $identifier is the URL they gave to you
  // $server is the server you discovered
  // $delegate is the identity you discovered
  // $returnURL is your auth endpoint to receive the results
  $x = parse_url($server);
  $params = array();
  if (isset($x['query'])) {
    foreach (explode('&', $x['query']) as $param) {
      list($k, $v) = explode('=', $param, 2);
      $params[urldecode($k)] = urldecode($v);
    }
  }
  // get assoc details from cache, or associate now.
  $assoc = $this->associate($server);
  $params['openid.mode'] = 'checkid_immediate';
  $params['openid.identity'] = $delegate;
  $params['openid.return_to'] = $returnURL;
  $params['openid.trust_root'] = YOUR_TRUST_ROOT_URL;
  $params['openid.sreg.required'] = 'nickname,email';
  if ($assoc !== false) {
    $params['openid.assoc_handle'] = $assoc['handle'];
  }
  $x['query'] = http_build_query($params);
  // you can now assemble $x into a URL and redirect the user there
?>

Once the user has authenticated against their ID server, they'll be redirected back to your $returnURL:

<?php
    $assoc = $this->associate($args['srv']);
    $token_contents = '';
    /* note well: the name in the token_contents hash is the
     * name without any prefix.
     * This nuance can keep you occupied for hours. */
    foreach (explode(',', $_GET['openid_signed']) as $name) {
      $token_contents .= "$name:" . $_GET["openid_" . str_replace('.', '_', $name)] . "\\n";
    }
    $x = hash_hmac('sha1', $token_contents, $assoc['secret'], true);
    $hash = base64_encode($x);
    if ($hash === $_GET['openid_sig']) {
      // Authenticated
      return true;
    }
    /* not valid for whatever reason; we need to do a dumb mode check */
    $params = array();
    $signed = explode(',', $_GET['openid_signed']);
    $signed = array_merge($signed, array('assoc_handle', 'sig', 'signed', 'invalidate_handle'));
    foreach ($signed as $name) {
      $k = "openid_" . str_replace('.', '_', $name);
      if (array_key_exists($k, $_GET)) {
        $params["openid.$name"] = $_GET[$k];
      }
    }
    $server = $args['srv'];
    /* broken spec.  You need to set openid.mode to
     * check_authentication to get it to do the auth checks.
     * But, it needs openid.mode to be id_res for the signature to work.
     */
    $params['openid.mode'] = 'check_authentication';
    $res = perform_openid_rpc($server, $params);
    if (isset($res['invalidate_handle'])) {
      if ($res['invalidate_handle'] === $assoc['handle']) {
        /* remove association */
        $this->associate($server, true);
      }
    }
    return $res['is_valid'] === 'true';
?>

Didn't he also mention TypeKey?

Yeah, here's how to validate the signature you get when your user is redirected back from TypeKey:

<?php
    $keydata = array();
    $regkeys = cache::httpGet('http://www.typekey.com/extras/regkeys.txt', 24*60*60);
    if ($regkeys === false) {
       die("urgh");
    }
    foreach (explode(' ', $regkeys) as $pair) {
      list($k, $v) = explode('=', trim($pair));
      $keydata[$k] = $v;
    }
    $sig = str_replace(' ', '+', $_GET['sig']);
    $email = $_GET['email'];
    $name = $_GET['name'];
    $nick = $_GET['nick'];
    $ts = $_GET['ts'];
    $msg = "$email::$name::$nick::$ts::" . TYPEKEY_TOKEN;
    if (time() - $ts > 300) {
      die("possible replay");
    }
    list($r_sig, $s_sig) = explode(':', $sig, 2);
    $r_sig = base64_decode($r_sig);
    $s_sig = base64_decode($s_sig);
    $valid = openssl_dsa_verify(sha1($msg, true),
                                openssl_bignum_from_bin($r_sig),
                                openssl_bignum_from_bin($s_sig),
                                $keydata['p'], $keydata['q'],
                                $keydata['g'], $keydata['pub_key']);
?>

Evildesk 0.9.0 released

I uploaded release 0.9.0 of EvilDesk tonight. I realized that I hadn't made a release in over a year, so I tidied up a few bits and pieces and uploaded it. Feel free to review the changelog if you're curious.

Highlights include an improved dock style toolbar, a launcher plugin (type the name of a program or document to find it and run it, instead of poking around the start menu), simpler configuration of the toolbar positioning, translations for German and French, less bugs and support for 64-bit Windows.

Enjoy!

2006 from Wez's perspective

Here's 2006 from my point of view. I did tinker a bit with the appearance of my blog, but that's not really a high point of the year for me. My year went a bit like this:

I started the year with a couple of EvilDesk releases, which in turn generated some snarky feedback from a couple of people in the PHP community, cooling my drive for working on PHP. Work and family pressures didn't help to restore my earlier level of PHP activity, and to be honest, some words and actions in the PHP community over the year didn't really help either.

MessageLabs chose us to provide their MTA infrastructure, which saw me back in the UK a couple of times at the start of the year while we worked together to plan part of the architecture.

I ported Solaris' memory manager (umem) to the "other" platforms (linux, windows and bsdish systems).

My older brother suffered a set-back this year, and I wish I could have helped him out a lot more than I did.

I started to test-drive google calendar, but that petered out because I can't put company confidential information in there. It's a shame, because it works well.

I finally got some of my writings published in a book, although not an entire book of my own.

I spoke at MySQLUC, OSCON, php|works and zendcon and attended the first MS web dev summit. Memorable moments around these events include being stuck in the seedier part of Phoenix for a night on the way to MySQLUC (not a fond memory to be sure!), awful karaoke at the Sun party @OSCON, excellent home-made sushi and xbox 360 on a 120" screen at a friends home in Seattle on the way back from OSCON, a really good British style pub in Toronto during php|works, reasonable karaoke at zendcon, Andrei's birthday party after zendcon and meeting Don Box at the MS web dev summit.

Our 4 year old son Xander started at pre-school and is doing well.

The death of my faithful toshiba m30 saw me adopt an intel based macbook, partly for the native unix environment and partly to force me to learn about the oddities of its runtime linker. I still have complaints about the way certain things work, but on the whole I am a happy user, made happier by Parallels because I still need Windows based software.

OmniTI has grown a decent amount this year, and we moved premises and now have two business units--OmniTI the Computer Consulting company and Message Systems the messaging company; both arms of the company have done well this year and are set to continue doing so in the future. Work continues to be fun, interesting and challenging, with great people on staff (that's all of them, not just the "internet celebrities").

I've been the architect of several large pieces of infrastructure for Ecelerity this year, one of which is in the realm of meta-meta programming (gives your brain a workout, guaranteed!). I'm looking forward to seeing the fruits of this labor in 2007, and to getting around to work on some more of the juicy ideas we continue to have for expanding and improving things.

What about PHP? I've been working on a unicode enabled version of PDO for the preview release of PHP 6. This should be completed soon, and I look forward to continued improvements in the PDO drivers and, in particular, the OCI driver which is long overdue some TLC. I've also been toying with something OSX specific for PHP that just isn't close to ready yet; maybe that will be something I can share in the first quarter of 2007.

Here's to a prosperous 2007!

Coding for source control

Hot on the heels of my Coding for Coders entry (focused on C), here's another on coding for source control.

When you have a large code base in a source control system (like subversion), you'll find that things go easier if you adopt a few coding practices that work in-hand with the way that the version control works.

Embrace branches and tags

You really should investigate how to use the branching and tagging feature in your source control system. A typical practice is to do development in trunk and have a branch for each major version of the code (eg: a 1.0 branch, 2.0 branch and so on), tagging that branch each time you reach a significant point in development and each time you ship code. Depending on your project, you might branch for minor versions too (eg: 1.2, 1.3).

Think in terms of changesets

If you're working on a bug fix or implementing a feature, it's good practice to distill the net development effort down to a single patch for the tree. The set of changes in that patch is the changeset that implements the bug fix or feature.

Once you have the changeset, you can look at applying it to one of your branches so that you can ship the fixed/enhanced product.

Trivial fixes can usually be implemented with a single commit to the repository, but more complex changesets might span a number of commits. It's important to track the commits so that your changeset is easier to produce.

We use trac for our development ticket tracking. It's easy to configure trac/subversion to add a commit hook that allows developers to reference a ticket in their commit messages and then have all the commits related to that ticket show up as comments when viewing the ticket. You can then merge each commit into your working copy and then check in the resulting changeset.

If one of more of your developers are making extensive changes, it's a good idea for them to do their work in their own branches. That way they won't step on each others toes during development. You might also want to look at creating a branch per big ticket--this will allow you to exploit the diffing/merging features of your source control system to keep track of the overall changeset.

Code with merging in mind

When you're making code changes, try to think ahead to how the patch will look, and how easy it will be for your source control system to manage merging that code.

A few suggestions:

  • if you have a list of things to update, break the list up so that each item has its own line.
  • if the list has a separator character (eg: a comma), include the separator on the last line of the list.
  • if you're adding to a list, add to the end if possible.
  • avoid changing whitespace, try to have your patch reflect functional changes only.

Your goal is to minimize the patch so that it represents the smallest possible set of changed lines. If you can avoid touching peripheral lines around your change set, you reduce the disk of running into conflicts when you merge.

Get into the habit of diffing your changes against the repository while your work, and certainly always diff before you commit. If you find in changed lines that are not essential for the patch (whitespace in particular), take them out!

Here's an example from a makefile:

      SOURCES = one.c two.c three.c

This is nice and readable at first, but over time this line may grow to include a large number of source files. People will tend to add to the end at first, and perhaps alphabetically when the number of files increases. The resulting diff shows a single modified line but won't really show you what changed on that line. Things get difficult when two changeset affect that line; you'll get a conflict because the source control system doesn't know how to merge them.

      # this is better
      SOURCES = \\
        one.c \\
        two.c \\
        three.c \\

Each item now has its own line. By getting into the habit of adding at the end, complete with separator or continuation character you help the merge process: each item you add will be a single line diff, and it will know that you're adding it at the end, improving the chances of a successful merge a great deal.

Adding at the end isn't the golden rule so much as making sure that everyone adds consistently. Often, order is important, so adding at the end isn't going to help you. By adding in a consistent manner, you reduce the chances of touching the same lines as another changeset and thus reduce the chances of a conflict.

Here's the same example, but in PHP:

      $foo = array("one", "two", "three");

better:

      $foo = array(
              "one",
              "two",
              "three",
             );

Dangling commas are good! :)

Keep the diff readable

Don't take the concept of small diffs too literally--if you can express your change on a single line that is 1024 characters long you've made the merge easier at the expense of making it really hard to review what the change does. This basically boils down to making sure that you stick to the coding standards that have been established for the project.

Don't sacrifice human readability for the sake of easier merging.

If you find that you need to merge a changeset to more than one branch (say you have a bug fix to apply to 2.0 and 2.0.1) then it's often easier to merge to 2.0 first, resolve any conflicts, commit and merge the 2.0 changeset into 2.0.1 rather than the trunk changeset direct to 2.0.1.

These practices aren't obtrusive and will help you when you need to merge a changeset from one branch to another.

I don't pretend to know everything, these are just a couple of tidbits I thought I'd share. If you have other similar advice, I'd like to hear it--feel free to post a comment.

Coding for coders: API and ABI considerations in an evolving code base

As you may know, we have an MTA product that is designed to be extended by people writing modules in C/C++, Java and Perl. To facilitate this, not only do we need to write the code for product, but we also need to provide an API (Application Programming Interface) to our customers and partners so that they can build and run their modules.

There are a number of considerations when publishing an API:

Make the API easy to use

If the API is hard to understand then people will use it incorrectly, which might result in things blowing up in rare conditions that didn't come up in their testing. APIs tend to be hard to use if they have too many parameters or do too many things. It's a good idea to keep your API functions small and concise so that it's clear how they are supposed to work.

If you have a complex procedure with a number of steps, you should encapsulate those steps in another API function. This makes it easier to perform that procedure in the future.

Good documentation is a key component to ensuring that the APIs are used correctly; not only does it tell people how to use the API, it tells you how people are supposed to be using the API. More on that in a bit.

Don't change those APIs!

Once you've created an API and shipped your product and its gloriously detailed documentation, people will start to use it. There are two broad categories of people that will consume your API: customers that are building their own modules and partners that build modules to sell to other people running the software. Any changes that you make to the API will require the first group to update their code, recompile and re-deploy. The latter group will need to do the same, but need to ship the updated modules to their customers.

This is a pain for both groups of people. If the API changes you make are extensive it requires someone there to become familiar with those changes and figure out how to migrate their code from the old API to the new API in such a way that things still work. They may not have the resources to do this at the point where you release those changes, so you really need to avoid changing the API if you're shipping a critical bug fix.

ABI changes are bad too

ABI is an acronym for Application Binary Interface. It's similar to API, but the distinction is that API affects how you program against something, whereas ABI affects how the machine code expects things to work. If you're coming from a dynamic/scripting background, ABI doesn't really apply. Where it really matters is in cases where you're compiling your code and shipping the result. When you compile your code, the compiler figures out things like offsets of fields in structures, orders of parameters and sizes of of structures and so forth and encodes these things into the executable.

This is best illustrated with an example:

   struct foo {
      int a;
      int b;
   };
   int do_something(int param1, struct foo *foo);
   #define DOIT(a, b)   do_something(a, b)

Now, imagine that we ship another release where we've tweaked some code around:

   struct foo {
      int b;
      int a;
   };
   int do_something(struct foo *foo, int param1);
   #define DOIT(a, b)   do_something(b, a)

From an API perspective, things look the same (assuming that people only use the DOIT macro and not the do_something() function). If you don't rebuild the code, weird things will happen. For instance, the a and b fields in the foo structure have swapped places. That means that code compiled against the release 1 headers will be storing what it thinks is the value for a in the b slot. This can result in subtle to not-so-subtle behavior when the code is run, depending on what those functions do. The switch in the ordering of parameters to the do_something() function leads to similar problems.

These problems will vanish if the third party code is rebuilt against the new headers, but this requires that the updated code be re-deployed, and that may require additional resources, time and effort.

ABI changes are bad because they are not always immediately detected; the code will load and run until it either subtly corrupts memory or less subtly crashes because a pointer isn't where it used to be. The code paths that lead to these events may take some time to trigger.

In my contrived example above there was no reason to change the ordering of those things, and not changing them would have eliminated those problems.

Avoiding ABI and API breakage

A common technique for enhancing API calls is to do something like this:

   int do_this(int a);

and later:

   int do_this_ex(int a, int b);
   #define do_this(a)   do_this_ex(a, 0)

This neatly avoids an API change but breaks ABI: the do_this() function doesn't exist any more, so the program will break when that symbol is referenced. Depending on the platform, this might be at compile time or it might be at run time at the point where the function is about to be called for the first time.

If ABI is a concern for you, something like this is better:

   int do_this(int a) {
      return do_this_ex(a, 0);
   }

this creates a "physical" wrapper around the new API. You can keep the #define do_this() in your header file if you wish, and save an extra function call frame for people that are using the new API; people using the old ABI will still find that their linker is satisfied and that their code will continue to run.

Oh, and while I'm talking about making extended APIs, think ahead. If you think you're going to need an extra parameter in there one day, you can consider reserving it by doing something like this:

    int do_this(int a, int reserved);

and then documenting that reserved should always be 0. While that works, try to think a bit further ahead. Why might you need to extend that API? Will those projected changes require that additional APIs be added? If the answer is yes, then you shouldn't reserve parameters because what you'll end up with is code that does stuff like this:

   // I decided that I might add 4 parameters one day
   do_this(a, 0, 0, 0, 0);
   // but when that day arrived, I actually added a new function
   // that only needed 3
   do_this2(a, b, c);

Those reserved parameters add to your code complexity by making it harder to immediately grasp what's going on. What do those four zeros mean? Remember that one of the goals it to keep things simple.

You might have noticed that I called the new version of the API do_this2() instead of do_this_ex(). This also stems from thinking ahead. do_this_ex() is (by common convention) an extended form of do_this(), but what if I want to extend the extended version--do I call it do_this_ex_ex()? That sounds silly.

It's better to acknowledge API versioning as soon as you know that you need to do it. I'm currently leaning towards a numeric suffix like do_this2() for the second generation of the API and do_this3() for the third and so on.

Each time you do this, it's usually a good idea to implement the older versions of the APIs in terms of calls to the newer versions. This avoids code duplication which has a maintenance cost to you.

Of course, you'll make sure that you have unit tests that cover each of these APIs so that you can verify that they continue to work exactly as expected after you make your changes. At the very least, the unit tests should cover all the use cases in that wonderful documentation that you wrote--that way you know for sure that things will continue to work after you've made changes.

Structures and ABI

I got a little side tracked by talking about API function versioning. What about structures? I've already mentioned that changing the order of fields is "OK" from an API change perspective but not from an ABI. What about adding fields?

   struct foo {
      int a;
      int b;
   };

becoming:

   struct foo {
      int a;
      int b;
      int c;
   };

Whether this breaks ABI depends on how you intend people to use that structure. The following use case illustrates an ABI break:

   int main() {
      struct foo foo;
      int bar;
      do_something(&foo);
   }

Here, foo is declared on the stack, occupying 8 bytes in version 1 and 12 bytes (maybe more with padding, depending on your compiler flags) in version 2. Either side of foo on the stack are the stack frame and the bar variable. If we're running a program built against version 1 against version 2 libraries the do_something() function will misbehave when it attempts to access the c field of the structure. If the usage is read-only it will be reading "random" garbage from the stack--either something in the stack frame or perhaps even the contents of the bar variable, depending on the architecture and compilation flags. If it tries to update the c field then it will be poking into either the stack frame or the bar variable--stack corruption.

You can avoid this issue by using pointers rather than on-stack or global variables. There are two main techniques; the first builds ABI awareness into your APIs:

   struct foo {
      int size_of_foo;
      int a;
      int b;
   };
   int main() {
      struct foo foo;
      int bar;
      foo.size_of_foo = sizeof(foo);
      do_something(&foo);
   }

The convention here is to ensure that the first member of a structure is populated with its size. That way you can explicitly version your structures in your header files:

   struct foo_1 {
      int size_of_foo;
      int a;
      int b;
   };
   struct foo {
      int size_of_foo;
      int a;
      int b;
      int c;
   };
   int do_something(struct foo *foo) {
      if (foo->size_of_foo >= sizeof(struct foo)) {
         // we know that foo->c is safe to touch
      } else if (foo->sizeo_of_foo == sizeof(struct foo_1)) {
         // "old style" foo, do something a bit different
      }
   }

Microsoft are rather fond of this technique. Another technique, which can be used in conjunction with the ABI-aware-API, is to encapsulate memory management. Rather than declare the structures on the stack, the API consumer works with pointers:

   int main() {
      struct foo *foo;
      int bar;
      foo = create_foo();
      foo->a = 1;
      foo->b = 2;
      do_something(&foo);
      free_foo(foo);
   }

This approach ensures that all the instances of struct foo in the program are of the correct size in memory, so you wont run the risk of stack corruption. You'll need to ensure that create_foo() initializes the foo instance in such a way that th*e other API calls that consume it will treat it as a version 1 foo instance. Whether you do this by zeroing out the structure or building in ABI awareness is up to you.

Encapsulation

You can protect your API consumers from ABI breakage by providing a well encapsulated API. You do this by hiding the implementation of the structure and providing only accessor functions.

   struct foo; /* opaque, defined in a header file that you
                * don't ship to the customer */
   struct foo *create_foo();
   void free_foo(struct foo*);
   void foo_set_a(struct foo *, int value);
   int  foo_get_a(struct foo *);

By completely hiding the layout of the foo structure, the consumers code is completely immune to changes in the layout of that structure, because it is forced to use the accessor APIs that you provided.

You can see practical example of this in Solaris's ucred_get(3C) API.

Encapsulation has a trade-off though; if there are a lot of fields that you need to set in a structure, you might find that the aggregate cost of making function calls to get and set those values becomes significant. My usual disclaimer applies though--don't code it one way because you think it will run faster--do it after you've profiled the code and when you know that it will be faster. It's better to opt for maintainability first, otherwise you might as well be hand-coding in assembly language.

Summing up

It can be hard to retrofit API and ABI compatibility; it's best to plan for it early on, even if you just decide that you're not going to do it.

Projects typically adopt a strategy along the lines of: no ABI (and thus API) breaks in patchlevel releases. Avoid ABI (and this API) breaks in minor releases. API will only break in major releases after appropriate deprecation notices are published and a suitable grace period observed to facilitate migration.

Folks that are truly committed to API/ABI preservation will have a long deprecation period and will add an extra restriction--API/ABI changes will be removals only.

API/API preservation is a challenge, but if you get it right, your API consumers will love you for it.

I'll leave you with some bullet points:

  • Avoid changing APIs.
  • Avoid changing ABIs.
  • It's particularly important to preserve ABI compatibility if you're shipping a patch level release, because people tend to put less effort into QA and might overlook a breakage.
  • If you need to expand, spawn a new generation of APIs rather than mutating existing ones.
  • If you need to expand structures, don't change the ordering of fields, add to the end.
  • encapsulate structures with APIs if you can.
  • Unit tests are essential
  • Documentation is very important

parser and lexer generators for PHP

[Update: I've put these parser/lexer tools on BitBucket and Github; enjoy!]

From time to time, I find that I need to put a parser together. Most of the time I find that I need to do this in C for performance, but other times I just want something convenient, like PHP, and have been out of luck.

This thanksgiving I set out to remedy this and adapted lemon to optionally emit PHP code, and likewise with JLex.

You need a C compiler to build lemon and a java compiler and runtime to build and run JLexPHP, but after having translated your .y and .lex files with these tools, you're left with a pure PHP parser and lexer implementation.

The parser and lexer generators are available under a BSDish license, from both BitBucket and Github:

See enclosed README files for more information.

Help build a public UK postcode database

Via BoingBoing:

New Public Edition maps are trying to create a freely usable UK postcode database. The British Post Office owns the database of postcodes and their corresponding coordinates. That means that your website can only use post-codes if you buy a license from the Post Office.

New Public Edition (along with a similar project, Free the Postcode) is trying to solve this. They have 1950s-era public-domain maps and they ask you to locate your house (or childhood home) on it and key in your post-code. They do the rest, eventually building out a complete database of every postcode in Britain.

The resulting data will be released as purely public domain--no restrictions whatsoever on re-use.

I just filled in a couple of postcodes from previous residences, and it was quite interesting to see how the area that I grew up in has changed since 1950; it looks like it used to be one large farm that was broken up into a couple of smaller farms that have now become residential areas. It's a logical progression really, but having a date like 1950 gives a sense of dimension--it's easy to think that that change happened "hundreds of years ago", but it's much more recent than that.

So, if you're in the UK, or lived there for a while, please take a couple of minutes to visit New Public Edition, fill in your postcode, and perhaps gain a better understanding of the places you've lived.