A tiny subset of your users can’t login: they get no error message yet have both cookies and JavaScript enabled. They’ve phoned up to report the problem and aren’t capable of getting a Fiddler trace. You’re serving a million hits a day. How do you trace their requests and determine the problem without drowning in logs?

Marketing have requested that the new site section your team has built goes live at the same time as a radio campaign kicks off. This needs to happen simultaneously across all 40 front-end web servers, and you don’t want to break your regular deployment cadence while the campaign gets perpetually delayed. How do you do it?

Users are experiencing 500 errors for a few underlying reasons, some with workarounds and some without. The customer service call centre need to be able to rapidly triage incoming calls and provide the appropriate workaround where possible, without displaying sensitive exception detail to end users or requiring synchronous logging. At the same time, your team needs to prioritize which bugs to fix first. What’s the right balance of logging, error numbers and correlations ids?

These are all real scenarios that Tatham Oddie and his fellow consultants have solved on large scale, public websites. The lessons though are applicable to websites of all sizes and audiences.

The talk will be divided between each of the three scenarios described here, and then stuffed full of other tidbits of information throughout.

Loading more stuff…

Hmm…it looks like things are taking a while to load. Try again?

Loading videos…