Peter Hicks' Blog – Page 6 – The personal blog of Peter Hicks

Crunching rail timetables

For those of you new to this blog, I’ve been doing some work with timetable data for a few months now, and I presented my work at OpenTech with Jonathan Raper earlier this year. I’m working with some other people to bring more information about the rail network out from behind the scenes and in to the hands of the public so people can innovate and analyse the data, and ultimately to increase transparency and accountability. Importantly, I am also pro-rail and looking to improve on what we have.

So – it’s taken a while, but TSDB Explorer can now load an entire ~500Mb CIF format timetable in around an hour on an average machine. Whilst I can undoubtedly improve this, it’s a lot better than the previous three days and multi-gigabyte monstrosity I wrote previously.

Several people are interested in the format of the CIF file, and I’m going to put a set of slides together soon to explain it. Hopefully David Cameron’s recent letter on open data will help make Network Rail-source CIF timetable data more prevalent, and my “How To” guide will lower the barrier for other people to write timetable analysers, produce train frequency graphs, generate pocket timetables, etc.

Watch this space – these are very exciting times.

[Test|Behaviour] Driven Development

Whichever option you choose, I am a convert.
Previous projects of mine were met with a “*sigh* I suppose I need to write some tests, but I want to get on with the code”. TSDB Explorer is the first codebase I’ve written where the majority of code existed only after the unit tests. I have to say, it’s many times more robust, and I’m going to continue the trend in TubeHorus.
Part of me is delighted too at rcov and its “Here’s how much of your code is covered by tests”. There’s a definite warm feeling to be had when you’ve covered nearly all your code with tests, although as @bobtfish said – “That just shows you that your bugs are within your unit tests”. But I can always write more of them to make sure I don’t re-introduce bugs.

Making a bad situation worse

Last night’s attempted cable theft at Woking wasn’t a pleasant experience for the thousands of people trying to get home. An earlier signalling problem at Clapham Junction disrupted my journey out to Putney slightly, but it was utter chaos later.

My journey back home would have been a nightmare had it not been for the exceedingly convenient London Overground service from Clapham to Stratford, the only criticism of which I can make is that the 2015 departure from Clapham gets to Gospel Oak at the same time at the 2050 service to Barking departs. A minor problem though.

Some hours later after I’d returned home and had dinner, I had a friend of mine call me up for advice on which trains to get back to Winchester – he’d been trying to get back from Waterloo, was advised to travel via Reading, and thanks to one of my Open Source projects, TSDB Explorer, I could tell him which trains to get and from where – but not if they were running or where they were.

Hearing the story in the news this morning, my jaw dropped when I heard that some passengers forced open train doors and made a run for it down the track. That’s an exceedingly bad thing to do, for a number of reasons:

First and foremost, there’s the danger of electrocution from the conductor rail – Module DC of the Rule Book sets out the details for the technically minded. Suffice it to say that if you stepped on, or slipped over on to the conductor rail, you’re not coming out of it unscathed.
Second, once the driver of a train receives an alert on the train’s management system that the emergency egress handle has been used on his train, he’s going to call the electrical control room and/or signaller immediately and get the power switched off, or ‘isolated’. This can only be done in an emergency for a large area, because in an emergency, you don’t have time to work out which parts of the supply to turn off (and sometimes you just don’t have the option – imagine trying to switch off just one socket on a ring main from the consumer unit in your house). The lack of power and knowledge that there are people on the track further screws up any attempt by Network Rail and South West Trains to get trains moving, however slowly. Even if the attempted cable theft affected two out of four lines, there are still procedures that can be undertaken to move trains without the aid of normal signalling systems – they’re slow, but they exist, and they are safe. So, the result of people ‘escaping’ from trains through frustration? More trains not moving for a long time because there’s no power to any of them. Oh, and without power, the air conditioning on trains won’t work. South West Trains’ fleet doesn’t have windows that can be opened – there’s no point with air-conditioning. Everyone else gets warm and agitated.
Finally, trespassing on the track is just that – trespass.

So, the moral of the story? However frustrated you are, don’t take matters in to your own hands and make a difficult but manageable situation in to a potentially serious incident involving death.

Open Source, Open Data

I’ve had a rethink about source code hosting. CVS is dead in the water, Subversion requires online connectivity, and I’m starting to use git with vigour. Hey, offline commits are perfect for coding on the train! (As an aside, I gave up trying to get WiFi access on a train to Leicester on Saturday, and didn’t even bother trying on Sunday coming back). Github is where it’s at – although despite today being World IPv6 Day, they don’t appear to have access over IPv6 natively.

The code for TSDB Explorer is up and out there and being actively worked on, as is TubeHorus, which is in a lesser working state. I anticipate getting around to putting TransportHacker‘s code on Github in the next week or so.

On another note, I’d like to thank the people at Network Rail who’ve been so helpful in talking to me about some of the data sets they hold. Whilst I’m not in a position to let the cat out of the bag yet, I am pretty excited about what’s coming in the next few weeks. Time to investigate Amazon EC2 I think… this may take some horsepower.

Open Rail Data

Jonathan Raper and I gave presentations on Open Rail Data – Jonathan from a more political angle, and me from a decidedly technical angle.

The material went down really well – there’s plenty of scope for us to show what can be done if timetable, real-time running and fares data is made openly available. I thoroughly enjoyed delivering the presentation – I haven’t done that since Berlin in 2006, and I’d forgotten how easily I slip in to “presenter mode”.

Here is a copy of my OpenTech 2011 presentation in PDF format if you’re interested. Or, if you simply want to get in touch, peter.hicks@opentraintimes.com.

I’m celebrating this evening with a curry.

Update – Jonathan’s presentation is also available

Google Maps' Data Quality

Harry Wood pointed out that Google Maps has removed Camden Town tube station from its map.Whilst I doubt Google have done this intentionally, it has set me thinking about data quality.
When developing TransportHacker (which isn’t live yet, there aren’t enough hours in the day!), I noticed the M25 was named “Autoroute Britannique M25”. It’s been corrected now, but how on earth did that one slip by?
More data quality issues (which may have been fixed by the time you read this):

Upper Holloway station has three icons – the Underground roundel, the Overground roundel, and the National Rail symbol. Click the Underground/Overground (Wombling Free?) icon, and you see it’s actually from the bus stop outside the station
Hop down to Highbury Corner, and you can see that Highbury and Islington station has the Underground and Overground roundels, but no National Rail symbol. Click on the roundels, and you’ll see that – yes – National Rail trains do serve the station
Examine, if you will, The Famous Cock. On Google Maps, it’s between Starbucks and Flight Centre. Google Streetview shows no Famous Cock there – in fact, it’s right next to Highbury and Islington station
Finally, what is White Stadt? I think it should be White City…

Here lies the danger with processing large sets of data – do you know they’re correct?

Speeding up Ruby on Rails' ActiveRecord INSERT rate

A project that I’m working on (OK, it’s TSDBExplorer) generates a metric shedload of database rows. For a record that says “This train runs between 01-01-2011 and 31-05-2011 on Mondays – Fridays”, the code generates a timetable for each day. It takes an age to import, and I hope it’s going to be fantastically quick at querying data.
There’s a big downside with ActiveRecord out-of-the-box – it takes a long time to INSERT a record in to a MySQL database. I left some INSERTs going at 9.50am, and they’d just about finished when I got back from the gym five hours later. 1.2 million rows in five hours is shockingly poor.
activerecord-import appears to solve the problem in the least impact way. To group up your INSERTs, you create new instances of a model object – say, Association. You push these in to an array, and then use the new ‘import’ method to do a mass INSERT.
I am quite happy at 62 minutes to insert 1.2 million rows, including processing, considering it’s an activity that only needs to be done twice, maybe three times a year.

National Fail Enquiries

Whilst I wholeheartedly support National Rail Enquiries’ aggregation of live train running data and disruption information, sometimes it can be wholly inaccurate and present a misleading picture.

Suppose I am travelling from Highbury and Islington to Shoreditch High Street today. I know these stations are on the same line, so I visit the Live Departure Boards site. I am presented with a warning saying there are no train services from this station on Sunday 3rd April.

What? But there’s a list of trains to West Croydon and Crystal Palace that all stop at Shoreditch. I visit the link in the warning and find that, actually, there are no trains between Stratford and Acton Central. The map linked to is very helpful actually, and it shows the route with the disrupted section in red. But what’s missing? The link from Dalston Junction to Highbury and Islington. So, do I need to go to Dalston Junction to take my train now?

The answer is actually quite straightforward – the website is wrong, and I know this because I’ve looked up the departures from Shoreditch High Street and seen that they’ve all departed Highbury and Islington.

What on earth is Joe Public going to do when presented with conflicting and incorrect information? It’s no wonder a number of people I know get aggravated at the quality of disruption information.

TransportHacker and DATEX II

I’ve spent a couple of weeks wrestling with Nokogiri to parse a tonne of DATEX II data in to some usable format. Previously I had a mash of libxml and REXML, and the code was either ‘fast’ or ‘pretty’, but not both.
Nokogiri is good – it’s very good, in my opinion. The only trouble is, documentation and examples are a little thin on the ground, which slows everybody down. Here’s the dilemma – do I spend time writing poor documentation based on my limited understanding of part of Nokogiri, or leave it to somebody else?

Doubled Sided Printing

Having run out of blank A4 paper and needing to print something, I decided “Hey, I’ve got 24 sheets, I’ll print this 40 page document double-sided!”
Why is it so difficult for me to get my head around how to do this? There are many ways I could screw up – pages back to front (printing odd and even pages on the same side), pages upside down (printing odd pages in one direction, and even upside down on the back), printing the odd pages in order, but the even pages in reverse order (the first sheet having page 1 on the front, and page 39 on the back), offsetting the pages by one sheet.

I only misprinted four pages. I call this a success 🙂