Speeding up Ruby on Rails’ ActiveRecord INSERT rate

A project that I’m working on (OK, it’s TSDBExplorer) generates a metric shedload of database rows. For a record that says “This train runs between 01-01-2011 and 31-05-2011 on Mondays – Fridays”, the code generates a timetable for each day. It takes an age to import, and I hope it’s going to be fantastically quick at querying data.

There’s a big downside with ActiveRecord out-of-the-box – it takes a long time to INSERT a record in to a MySQL database. I left some INSERTs going at 9.50am, and they’d just about finished when I got back from the gym five hours later. 1.2 million rows in five hours is shockingly poor.

activerecord-import appears to solve the problem in the least impact way. To group up your INSERTs, you create new instances of a model object – say, Association. You push these in to an array, and then use the new ‘import’ method to do a mass INSERT.

I am quite happy at 62 minutes to insert 1.2 million rows, including processing, considering it’s an activity that only needs to be done twice, maybe three times a year.

National Fail Enquiries

Whilst I wholeheartedly support National Rail Enquiries’ aggregation of live train running data and disruption information, sometimes it can be wholly inaccurate and present a misleading picture.

Suppose I am travelling from Highbury and Islington to Shoreditch High Street today. I know these stations are on the same line, so I visit the Live Departure Boards site. I am presented with a warning saying there are no train services from this station on Sunday 3rd April.

What? But there’s a list of trains to West Croydon and Crystal Palace that all stop at Shoreditch. I visit the link in the warning and find that, actually, there are no trains between Stratford and Acton Central. The map linked to is very helpful actually, and it shows the route with the disrupted section in red. But what’s missing? The link from Dalston Junction to Highbury and Islington. So, do I need to go to Dalston Junction to take my train now?

The answer is actually quite straightforward – the website is wrong, and I know this because I’ve looked up the departures from Shoreditch High Street and seen that they’ve all departed Highbury and Islington.

What on earth is Joe Public going to do when presented with conflicting and incorrect information? It’s no wonder a number of people I know get aggravated at the quality of disruption information.

TransportHacker and DATEX II

I’ve spent a couple of weeks wrestling with Nokogiri to parse a tonne of DATEX II data in to some usable format. Previously I had a mash of libxml and REXML, and the code was either ‘fast’ or ‘pretty’, but not both.

Nokogiri is good – it’s very good, in my opinion. The only trouble is, documentation and examples are a little thin on the ground, which slows everybody down. Here’s the dilemma – do I spend time writing poor documentation based on my limited understanding of part of Nokogiri, or leave it to somebody else?