Stoned and Wired: Behind Discreet’s Y2k+4 bug

“It was clear from the look on their faces – that they weren’t joking… we knew we had a global problem, that’s when we hit the panic button”. Just three hours after a major support call from Paris, plus eight and a half years after an MIT graduate in Boston decided to generate a unique ID for any frame in any discreet system using a 28 bit number, Bill Roberts (Director of Product Management – Systems) realized that Discreet had a serious problem. Ever

Stones/StoneFrontBill Robert’s flight had been delayed out of New York. This coupled with the largest infrastructure problem in Discreet’s history meant : 48 hours, 1700 phone calls and 4000 system updates later, Roberts was short three nights of sleep and a little tired as he pulled off the highway into a Homebase store car park to answer his cell phone. Fxguide was calling him from halfway around the world, exactly 1 hour before the time every major Australian east coast effects suite would have started shutting down, if Roberts and the other 50 odd members of the ad hoc emergency response team hadn’t just completed system upgrades to 97% of the world’s Discreet edit and effects suites.

“Like everyone, I thought it was a joke at first — an urban myth — but sure enough when we saw the code, it was clear. We had a problem”, Roberts recalls. It started in Paris where two people had a weird problem. Their system clock was not correct, but other than that their suite should have been working but the Stones wouldn’t mount.

error message:
System time 1096830485 is greater than SW epoch time limit 1096770256

It was discovered that if the system clock was turned back the suite came back to life and there was no loss of data – everything just seemed fine. As with any major company, the support team have a process of fault escalation. The fault moved up the chain of response to Canada, where the engineer responsible for the logged support call started looking through the code to find the relevant error message generator.

Then they found it. It was not exactly hidden either. Clearly documented in the number generator for the frame IDs of any Stone file system worldwide, was an built-in limit. Given that the system counted in seconds, and that the original programmer who wrote the code had chosen only a 28 bit number, exactly 8 and a half years later the system would see that the date exceeded the maximum allowed and not allow the file system to mount. The employee has long since left the company, but this was no digruntled employee – it wasn’t an employee hell bent on revenge, who would lay a time bomb that would take that long to activate? Clearly the programmer expected that the code would have long since been rewritten, fixed or discarded before 2004. And if it had have been a disgruntled employee they wouldn’t have documented the issue in the code comment fields? “Exactly, ” agrees Roberts ” that’s what I said, – I mean the number generator worked, – we had no reason to re-write it – it isn’t as if we didn’t have other things to do!”

photos courtesy Rich Bobo (L), Jake Parker (R)

It was 11am when the problem first arrived on someone’s desk in the converted warehouse on Duke Street, in the heart of Montreal – which is Discreet’s head office. It was now 2 pm and Robert’s called “everyone” to let them know what was going on. As one employee told us off the record, ” the word went out – no one goes home tonight!”, but in reality the request was completely unnecessary, some of the team member stayed voluntarily for 37 hours straight.

Discreet is part of a publicly listed company and an issue like this is a major corporate deal. Discreet phoned Autodesk and spoke directly to CEO Carol Bartz and Chief Operating Offer Carl Bass. Autodesk immediately offered the entire resources of the parent company to help with the problem. However, the issue at this point was not something that Autodesk head office could help with.

Discreet faced three immediate problems. It was the 29th of September – they had just 2 days to solve this problem, they didn’t have a solution, and they needed every suite owner in the world to know about it with enough time to apply the fix – whatever it would be. Otherwise, when the world went to work on Monday morning, the world’s advertising, film and effects community would have a lot of dead suites and Discreet could potentially face multi-million dollar law suits and a global community of annoyed users and clients.

Martin Vann and Marc Petit, Discreet’s executive team in charge of Sales, Marketing & Support, and Product Development, respectively, mapped out the plan for the company. Initially 50 people were put onto the problem and divided up into three teams. The first was the R&D team that needed to solved the bug. Of course just throwing engineers at such a problem is counter productive, and as it happened it was just 4 or 5 of Discreet’s best engineers who solved the problem. The second team was an outward bound communications team. This group was needed to inform all users and all discreets direct and reseller service and support staff. The third group was documentation, since when the fix came it would clearly need to have iron clad instructions on how to implement the fix.

Stones/StoneBackThere was one other additional problem that later surfaced. As this problem was embedded deep in the original code, – and every system sold over the last 8 and a half years was in trouble – discreet would need to test it on a very wide range of systems. It soon became apparent that while discreet had many standard configurations, and all of the newest ones, – there were older systems that discreet generally no longer supports – that would also need the fix. So a team was put together to cobble up an ad hoc ‘beta’ team of every system ever supported. This meant in some cases meant assigning individual support staff to particular customers with unique or nearly unique old configurations.

It was in these opening hours that fxguide was contacted. Fxguide was contacted because it is considered to ba a high “contact point for users, especially those who might have left flame news” explained Roberts via his cell phone days later. “Of course, we used every list we had…from flame-news to backdraft-beta-news, to get the word out. In fact, the biggest complaint we have had to date has almost been that people were hit with this information multiple times. Sometimes by us and then again by resellers and support people”. Fxguide, like much of discreet, was informed that there was a problem but that the details, fixes and suggested courses of action were still being hammered out and we were advised to wait – which we did – standing by for hours into the first night.

The problem did disappear by rolling back the system clock. When the system clock was rolled back there was no loss of data but this was not a viable solution for most users. First, while it did not effect discreets files internally, there was no way of knowing how changing the clocks would effect other software and any servers connected to the suites. Secondly, all temp or evaluation licences, beta licences and the like issued by discreet have safe guards built in to stop exactly this practice of winding back the system clock to extend the evaluation licence period. Adjusting the clocks on these systems would completely shut down the system.

The patch was found and posted in stages. Roberts explains that “the pre 2.0 patch was a little harder than expected, and no one really knows this yet, but our fix for pre 2.0 software has only bought another eight and a half years.” In other words, while the problem is completely fixed for most systems, for the earlier un-upgraded pre 2.0 systems, the solution is effectively to restart the clock, “so I have a huge diary entry to review the situation and re check it again in seven years time ” joked Roberts.
Stones/DateRoll-dlwebimage

Additional bandwidth was arranged for servers and internet connections for people to download the fixes. Even with the incredibly small size of the fix, thousands of users would be hitting the servers at the same time. Discreet faced similar bandwidth demands when the 3D Max community hit the servers for the newest release, so systems to handle this were already in place. “While we don’t have a (emergency plan) sealed folder we rip open, we did have disaster plans and escalation procedures in place” points out Roberts, and “our licence contacts database was in better shape than even we thought”, but at the core of the solution would be the passion of the ad hoc team for making sure this fix was done right. To succeed a wide group of people throughout all parts of the company – a worldwode would need to work collectively in a very short amount of time, according to Bill Roberts ” it was the ability of each person who was assigned a task to not only meet but exceed the objectives set for them that made this a ‘low-impact’ event for our customers”.

Discreet has done a second round of follow up calls and random checks to make sure everyone has managed to install the fix correctly. While the core team has now headed off for some well earned rest, discreet is back to business as usual. To the best of Bill Roberts’ knowledge, the only users who have lost billable hours – other than the maintainence time to install the fix – are the original two guys in Paris.

Questions remain: how did discreet’s Y2K scrub fail to find this in 1999? Do any other such landmines exist? And just why was it a 28 bit number, and not say a 32 bit number that was originally used ? These and other internal questions are the focus of discreet’s executive team, engineers and Roberts’ upcoming week. Until then he has to buy garden pipes and some beer. Oh…and get some sleep.

UPDATE OCT 27th:

Unconfirmed sources have just jokingly warned us that the database for the file system also has a date stamp issue and that the database internal date stamp will revert to January 1st 1970 on December 13th 1901hr in the year 2038. Feel free to use it until then 🙂

>
Your Mastodon Instance