Warning signs for TSB's IT meltdown were clear a year ago, according to insider

cleansy · on April 28, 2018

Yeah, these things don't "just happen". I cannot imagine even in a relatively low stakes environment (let's say a photo sharing app) to fuck up that badly without getting a heart attack.

I am in Spain right now and I have to get cash from an ATM soon, it feels like russian roulette to do so. I had to use a money transfer service UK internally to pay a dentist bill, because TSB's online banking website was showing an outdated phone number of mine that is used to verify transactions for new recipients. And of course, when you change the number you cannot use the new number for 2 days because of "security concerns".

IBMs involvement in the case doesn't fill me with confidence either.

I really hope this disaster is finding its way into MBA courses as an example of why you need a sane migration path, "no matter" the costs.

EDIT: removed the request for recommendations for new banks, should be a different thread.

ethbro · on April 28, 2018

Presumably they plunked down the cash to hire an IBM A-level team, not the turn the crank-level most people gripe about.

Already__Taken · on April 28, 2018

This is funny because in the uk an A-level team would be a bunch of 19 year olds thinking about going to university.

So I'm still not sure what you intend to say about the quality of IBM

DC-3 · on April 28, 2018

s/19/18 and 17

shakna · on April 28, 2018

IBM couldn't run Australia's census despite a sizeable budget, and they certainly screwed up Canada's payroll project.

They don't seem to be able to manage any large project, even with an 'A-Team'.

deadcat · on April 30, 2018

They also fucked the Queensland Health payroll system.... 1 billion over budget.

shakna · on May 1, 2018

Which thankfully got IBM banned from Queensland government use. [0]

[0] https://www.itnews.com.au/news/queenslands-ibm-ban-lives-on-...

thisisit · on April 28, 2018

I worked on a bank migration project, early on in my career, and these things are a nightmare all around.

First, building something like this requires an acute understanding of banking software. And banking software means something written in COBOL, RPG400 etc. These languages are pretty old and hence, finding talent for these is like trying to find needle in a haystack. So, most of the stuff has to be done via bruteforce, trial and error. So most of the analysis provided by expert and "senior" business analysts is just that, analysis. Engineers have to bang their heads against the wall to glue stuff together.

Secondly, everyone has to be on the same page. The idea has to that customers are priority and egos are not.

So stuff like this shouldn't happen at all:

> To make matters worse, the Sabadell development team did not have full control – and therefore a full understanding – of the system they were trying to migrate customer data and systems from because Lloyds Banking Group was still the supplier.

In our case, the team managing the original product made it difficult for us to merge customer data. They would frequently seed the data incorrectly and refused to provide proper data dictionary. They demanded training on the new product and that all data transformation should be done by them.

Needles to see after spending 3 year and 100s of millions, the project was scrapped. The migration was never completed and both banks remained on their respective systems, kicking the can down the road.

arthurfm · on April 28, 2018

> banking software means something written in COBOL, RPG400 etc. These languages are pretty old and hence, finding talent for these is like trying to find needle in a haystack.

It's really interesting to see which of the new UK and European 'challenger' banks purchased third-party banking software and which decided to spend the time writing their own modern banking systems [1]. A few examples from the linked article...

Monzo: For banking ops, it decided to build its own platform. Technology used is mainly open source: Linux, Apache Cassandra distributed database (used by the likes of Apple and Twitter), Google’s Go (golang) programming language at the back-end and PostgreSQL relational database. The system is hosted at two data centres in the UK on Mondo’s own hardware. There is a team of 16 people working on this.

Atom Bank: It has created a hefty technology set-up in its run up to the launch: FIS’s Profile core banking system; FIS/Sungard’s Ambit Quantum and Ambit Focus for treasury and risk management; Iress’ Mortgage Sales & Origination (MSO) suite for mortgage business, front-to-back office; Wolters Kluwer’s OneSumX for regulatory reporting; Intelligent Environments (IE) for front office capabilities; CSC’s ConfidentID system for security; Phoebus Software for secured business lending and account servicing for residential lending; and WDS Virtual Agent for customer queries supplied by WDS (a subsidiary of Xerox).

Starling: The bank has an in-house developed core system. It uses GPS and Bottomline Technologies for processing and payments operations, respectively.

At what point does it become cheaper to buy one of these smaller banks and migrate everyone across to their platform?

[1] https://www.bankingtech.com/2018/04/uk-challenger-banks-whos...

obeattie · on April 28, 2018

I’m Monzo’s Head of Engineering. A lot of the technologies mentioned above for us are accurate but some aren’t in use - notably PostgreSQL - and some others which are crucial are missing - eg. Kubernetes. The vast majority of the platform runs on AWS. Our engineering team has grown to around 60 now.

Our entire backend has been built in-house using modern tech. You can read a bit more about it on our blog if you’re interested: https://monzo.com/blog/2016/09/19/building-a-modern-bank-bac...

collyw · on April 28, 2018

Modern doesn't necessarily mean good. AWS hasn't been without its share of problems. "Tried and tested" is something I would be looking for with banking software.

Saying that TSB's stuff has just gone titsup and as far as I am aware they aren't using anything too trendy.....

dankohn1 · on April 28, 2018

And you can see Oliver describe his experiences at Monzo next week at KubeCon Copenhagen.

https://kccnceu18.sched.com/event/Dsan/keynote-anatomy-of-a-...

Stranger43 · on April 28, 2018

The real answer here is that it's proably never going to be cheaper to buy a bank with a working it platform as merging terabyte worth of actively changing data from an underdocumented 1960 noSQL database into a new greenfield product is the hard problem almost nobody have succeeded at doing without massive problems and unexpected delays and costs.

In reality what happens when a bank buys another bank is that yet another middleware layer is introduced so that frontend system can access data from both of the old system not that one system is discarded while the other survive.

shakna · on April 28, 2018

> underdocumented 1960 noSQL database

Depends on what you mean by 'undocumented'.

The specsheets for the Fortran part of the stack at CommBank was 4000+ pages last I saw it. The COBOL side was 9000+.

Then there were a couple million test cases for both, and a running system with no real users that got tested before production.

tim333 · on April 29, 2018

You'd think they could copy over the limited data people actually need - account balance, direct debits, mortgage status and payments and the like over to the new system and archive all the other stuff?

collyw · on April 28, 2018

The majority of banking software does work and is well tested. As opposed to new stuff which will likely have some bugs to be ironed out.

yomly · on April 28, 2018

Maybe it is part of crafting an article but did anyone else find the contrast of the looming tech disaster and team champagning really unfair?

The issue was obviously a systematic one, and that 18 month slog would have been a horrific death march for the people actually working on the project so why shouldn't they be able to celebrate it being "over"?

Granted, it was a failure, but I'm not really sure what the floor staff - the actual "software engineers" who were pictured had to do with it when they were set an impossible goal to begin with...

forapurpose · on April 28, 2018

> why shouldn't they be able to celebrate it being "over"?

Because it's not nearly over. In my experience, going into production is the most worrisome phase. For one thing, no matter how much you test and prepare, you just never know what's going to happen. For another, there are the initial bugs and user confusion that come with any new deployment, and the rush to respond to them effectively and quickly in order to maintain user confidence in the new system. And in the bank's situation we're not talking about an update, but an entirely new thing made from whole cloth (or code), if I understand correctly.

The outcome of the project, the goal, isn't to press the big red 'release' button, it's a stable, functioning system that meets the project specifications. When you have that, then you are done and it's time to celebrate. And exhale.

yomly · on April 28, 2018

You're right - I fully agree with you, but I actually think that's kind of an entirely separate issue.

If it took your team an 18 month death march sprint just to get to the "red button" stage, I have a hard time imagining many people who wouldn't toss it all in without some kind of company recognition and were instead told "lol JK welcome to level 2 - hope you're ready to work harder now".

I'm not sure even the videogames industry could get away with not pandering to morale that badly without mass exodus.

svcop3 · on April 28, 2018

Nope. Don't preen like a peacock all over social media until the job is really done. I certainly wouldn't.

Xuper · on April 28, 2018

I never feel like celebrating the releases that were shipped even solely by me. Maybe once I build something with Idris or Ada?

sanderjd · on April 28, 2018

Nah I think this is just a personality thing. Coworkers often try to get my hyped up about finishing some release and I've learned that it's just not how I respond. My "celebration" is just to be relieved, go home and tell my wife about it, and start thinking about the next thing.

shakna · on April 28, 2018

Taking a moment to enjoy a success, or just that a major heap of work is over, is critical.

The industry has a major problem with burnout. If we don't slow down, and take a break, then eventually, burnout becomes inevitable. The mind gets overworked.

Some people celebrate to recharge. Others stop using a keyboard, and find something else to do.

I don't recharge around others, I find it exhausting. But after a major project ends, I do usually find myself buying the new hit PC game, or taking a hike into the mountains.

Your coworkers celebrate, so that they can feel the weight of the release lift off easier.

Something else might already be playing that role with you - and some releases will be easy for you, and several months of hellish stress for someone else.

sanderjd · on April 29, 2018

I think you missed my point (I probably didn't make it well): my point was just that "celebration" in the typical sense, is not how I take a moment to enjoy a success.

heavenlyblue · on April 28, 2018

People need milestones. It makes them going rather than getting depressed about the fact that it all goes nowhere (in their opinion).

ams6110 · on April 28, 2018

Agreed. Same reason I skip office Christmas parties, "team building" outings, retreat weekends, etc.

snarfy · on April 28, 2018

Fixed staff size. Fixed deadline. Fixed feature set. Something was going to give. You can't have all three.

tootie · on April 28, 2018

If you fix scope and timeline, then the thing that has to give is quality. It doesn't seem like they failed to deliver all the parts of their system, they just didn't all work correctly.

lowken10 · on April 28, 2018

I worked for Lloyds TSB around 2003/2004. In banking the domain knowledge (banking & finance) is more valuable then the technology knowledge. There was a guy at Lloyds who couldn’t write a line of code, but he know every field and every column and what that field meant and why it was there.

This guy was as close to unfireable (is that a word) as it gets.

danburbridge · on April 28, 2018

That sounds familiar. I was there too around 2003-2005, very little of what has been said surprised me and rang a lot of bells from my time there. Hugely siloed and with very strict hierarchies.

sizzle · on April 28, 2018

This is why I'm leaving fintech for good. Is healthcare worse??

tim333 · on April 29, 2018

Having read a bit, yeah it seems worse. At least banking is about numbers you can put in a spreadsheet, health less so.

tonyedgecombe · on May 1, 2018

You would think domain knowledge extended to not using floating point values for a balance, I just downloaded a CSV statement and it's full of entries that look like 1234.560000000003.

tonyedgecombe · on April 28, 2018

They announced this week they are going to bring IBM in to help resolve the problems, somehow that doesn't fill me with confidence.

ams6110 · on April 28, 2018

The team at TSB must be utterly incompetent, if the alternative of bringing in a new team, with absolutely no knowledge of the systems or what has been done in the last 18 months, is thought to have a better chance at working. That's the message I get.

More likely, IBM will install a standard COTS banking software platform, migrate data where they can, and declare success. If account balances match, that will be enough. Desired functionality, either for internal users or customers, will be secondary.

And that's probably what TSB should have done from the beginning, minus the IBM involvement. That hasn't worked well historically, based on the number of lawsuits against then for failing to deliver contracted systems.

9935c101ab17a66 · on April 29, 2018

> The bank’s boss, Paul Pester, said TSB will waive £10m in overdraft fees and pay extra interest on current accounts. He has hired a new team of IT experts from IBM who have been told the problems must be fixed by Saturday.

(a quote from a different article).

This guy is an absolute joke. Just because you want something fixed quickly doesn't mean it's gong to happen. Bringing in an outside team is already a REALLY bad sign, but demanding an outside team to get up to speed and to implement a complete fix in two days? Yikes.

wiredfool · on April 28, 2018

Now they have N+1 problems.

staticfish · on April 28, 2018

But now a third-party scapegoat which would take the heat off of TSB execs.

votepaunchy · on April 28, 2018

This could even be 2*N.

BillinghamJ · on April 28, 2018

noir_lord · on April 28, 2018

(N^N)!

S_A_P · on April 28, 2018

Systems integration is one of those problems that is new every time because every environment is different. Sure patterns arise and sometimes they can be cut/pasted between organizations but most of the time there is just enough difference to make it a huge risk. I have written at least a dozen integrations between SAP and commodity trading platforms and the mantra DRY doesn’t really apply. You start over from scratch each time. Sure I know more about the quirks of the various systems but just because it works at the last place doesn’t mean it works now.

All of these systems are moving targets as well at various version/patch levels so the best way to estimate projects of this nature is to take a conservative estimate and double it. Then add 50%.

lordnacho · on April 28, 2018

> When TSB split from Lloyds Banking Group (LBG), a move forced by the EU as a condition of its taxpayer bailout in 2008, a clone of the original group’s computer system was created and rented to TSB for £100m a year.

> That banking system was a “bodge of many old systems for TSB, BOS, Halifax, Cheltenham and Gloucester and others” that had resulted from the “nightmare” integration of HBOS with Lloyds as a result of the banking crisis, according to one insider who had extensive access to and intimate knowledge of LBG and TSB’s internal systems over a prolonged period.

That sounds completely crazy. If you've got £100m a year in IT budget, why on earth would you buy a clone of Frankenstein?

You could hire a fine team of devs to build you a modern system. Then again, I'm not the kind of guy who believes in "never rewrite" which seems to be the advice.

> On Thursday he admitted the bank was on its knees, announced that he was personally seizing control of the attempts to fix the problem from his Spanish masters, and had hired a team from IBM to do the job.

This doesn't give me much confidence, either. Hiring outside help is a Coase problem. You're going to find frictions dealing with the externals. And it will cost you, I'm guessing, at least £100M a year.

With that kind of budget, and with IT being more or less all a retail bank does, you should hire hundreds of experienced staff, make them integral to the business, and let them solve the issues as they appear to the business units. When things happen they will have an idea of what the priorities are. There are plenty of software people who understand how banking works, and what systems are needed. Go and hire them.

Stranger43 · on April 28, 2018

And shut down your entire operation while said team worked?

The problem is that the average banks system is a kind of Frankenstein tree that have grown inside and around every policy, procedure and task the bank performs without any coherent design and with several dozen loosely coupled component each with is own poor and fragmented documentation.

And while it's typically only required to remain up 16-18 hours a day you have a zero allowance for unplanned downtime and a fairly high peak load which along with the age and complexity of some of the components make the entire systems a nightmare to run on a tight budget.

And it's worth nothing that the system that failed was not the Frankenstein system they inherited from Lloyds but the new one they tried to import from Spain.

ethbro · on April 28, 2018

This. Banks can't have extended downtime.

So if a $100M budget would get a new system built, then what they would need is a $200M budget ($100M to run legacy in prod + $100M to build new system and gradually migrate).

The only good in-house systems I've ever seen we're (a) based on vendor reference designs w/ minimal changes, (b) based on OSS, (c) architected by some very smart people who stay at the company for non-monetary reasons.

Because when you boil it down to it, no company is big enough to solve a problem better than a group working with multiple companies (unless the problem is trivial).

Or to word it another way, are you bigger / better / smarter than both of your top competitors put together? If no, then don't reinvent the wheel.

Khaine · on April 29, 2018

Banks can't have any downtime. They need to be able to process EFTPOS and Credit Card Transactions 24x7. Now, not all systems need to be always available in the bank, but the key ones do.

walshemj · on April 28, 2018

Which is based on a cobol system from Accenture from what I hear.

Chyzwar · on April 28, 2018

I worked in two big banks.

It is not easy. Many regulations and existing legacy systems result in systems build around FTP and flat files. Bussiness side also does not have a clue, sometimes it is very difficult to tell why someone made a decision 10 years ago since nothing is documented.

The only hope is to replace systems one by one and make gradual migrations. Problem with this approach is that you need to add new interfaces to existing legacy systems and add some crazy stuff to the new system.

Above would be possible if you can hire top talent and retain it for many years. With current IT market, it is not possible. Even google and facebook that pay way above market struggle to retain people for more than two years. Additionally, business domain is boring and technology is outdated.

jstanley · on April 28, 2018

> Even google and facebook that pay way above market struggle to retain people for more than two years

Perhaps they're not paying "way above market" then. If buyers paying "way above market" struggle to attract sellers, it means the buyers are wrong about what the market price is.

user5994461 · on April 28, 2018

People find a partner, have children, move to follow their partners or move back to take care of their parents. Projects are completed, roles shift, company changes.

A higher salary doesn't have any effect on any of the above.

devonkim · on April 28, 2018

Attracting is one thing, retention is another. Pay isn’t the reason people leave from FANGS oftentimes, but pay is oftentimes the reason people move to FANGS.

Chyzwar · on April 28, 2018

  http://uk.businessinsider.com/employee-retention-rate-top-tech-companies-2017-8
  http://uk.businessinsider.com/comparably-50-best-paying-big-companies-salary-employees-2017-11

Talented people want to make an impact. Working for google building proto-to-proto services is opposite of making an impact. On average your work will have very little spotlight but you will make lots of money for the company.

Finnucane · on April 28, 2018

If you were a jazz musician in the 1920s, the best-paying stable job you could get was in Paul Whiteman's band. So Whiteman was able to attract some of the best players of his day (well, the best white players of his day, anyway). But the work frequently meant playing boring arrangements for upper-class toffs. There was a fair bit of turnover in the staff despite the good pay. Some of his players just drank themselves into an early grave.

roel_v · on April 28, 2018

You can't just say 'hey guys we'll build you a new system, see you in two years'. It's about the migration path, it always is, building new systems is easy.

collyw · on April 28, 2018

Thats the difficult part. Then you have an untested (in the wild) system.

user5994461 · on April 28, 2018

>>> That sounds completely crazy. If you've got £100m a year in IT budget, why on earth would you buy a clone of Frankenstein?

It's a fairly low budget for a big company. That's a few hundreds employees at an average cost of $100k then the rest goes to hardware, suppliers, support contracts and other costs.

pebers · on April 28, 2018

That was the system they already had, spinning out a clone of the old one they were using was presumably seen as the easiest way to separate the two banks. And for all the negative press it got in that article, it did at least work...

marvin · on April 28, 2018

Banking is particular in the sense that there is often a large number of legacy systems that need to communicate reliably, at the same time that major business/technology decisions are made by leaders who do not have technical understanding. (Even in my native Norway, lauded for good digital banking services, almost no banks have a single technologist in their executive team -- the culture is changing, but most places still consider technology a service that is purchased and mostly separate from business concerns).

In a good case, top leadership will listen to architects and leads on the tech teams before making critical decisions, but in some cases, "business goals" will trump concerns from the technologists. This is of course a huge failure of communication, but it is an even bigger failure of organization. If technologists have to threaten to quit just to get their point across, the organization is broken.

Let's say I'm a senior banking developer i Europe. I'm being paid a fixed, moderate salary with a fixed number of hours each week, and I get no part of the bonus if this €100 million initiative succeeds, and I suffer no loss if we incur another €100 million of extra costs due to this failure. If I quit, it's with 3 months notice and it's a big PITA to find new work that suits my interests. What incentive do I have to do anything but do my best to alert the leadership to these problems, and then do my best to move the train of failure along?

The exact scenario described in this story -- an expensive service contract terminated at a hard date, with costs to re-instate this contract therefore becoming even bigger, and the development team being pressed to deliver on this deadline whether it is realistic or not -- does not surprise me at all, and must have happened dozens of times all over the world.

If it goes wrong on a small scale, you will only see a few hundred or thousands of customers affected in a non-catastrophic manner (e.g. see the wrong balance in their accounts, but with the correct number being accessible in the back-end system), but it stands to reason that this example would at some point happen at a spectacular scale with no easy way back.

I'm not holding my breath, but at some point the boards of these banks should realize that technology is a core competency, and get people with tech skills in a position to make critical top-level decisions. (Not to mention get pay to have some semblance of connection with the sums involved in the success or failure of the work -- I get the impression that this is the case elsewhere in finance).

timthorn · on April 28, 2018

> at some point the boards of these banks should realize that technology is a core competency

The regulator already has. Some years ago they fined NatWest after an IT issue because the management processes and technology risk management was not robust; I'd expect that they'll take a very hard look at this case.

walshemj · on April 28, 2018

There is a Recent Job Add 3 days ago or a CIO on linked in https://www.linkedin.com/jobs/view/633114556/

Macha · on April 28, 2018

More likely a troll? I can't believe a major bank is hiring a C-level on LinkedIn

sulam · on April 28, 2018

I think it’s a miscommunication. That looks like a senior role in their CIO ‘group’, not specifically the CIO.

bklaasen · on April 28, 2018

There was a similar catastrophic failure at RBS back in 2012, which took a month to recover from: https://en.m.wikipedia.org/wiki/2012_RBS_Group_computer_syst...

Tsiklon · on April 28, 2018

I remember that quite painfully, I was an ulster bank customer back then. albeit I was fortunate enough to be a broke student with not much going in or out of my account then.

wiredfool · on April 28, 2018

"""“This turned what was a super-hard systems job [into] a clusterfuck in the making,” the insider said"""

Oh dear.

Later:

"""The bank has been forced to cancel all overdraft fees for April and raise the interest rate it pays on its classic current account in a bid to stop disillusioned customers taking their business elsewhere."""

I think the main reason their customers aren't taking their business elsewhere is that their money is stuck till this is resolved.

tialaramex · on April 28, 2018

The UK's banks are all obliged to offer a system to consumers (and small businesses now too) where - manually if necessary - the bank transfers that customer's current account to another bank in a specific time period (one week? 10 working days? I don't remember). This is a result of government investigators concluding that banks weren't actually facing much competition because their customers thought switching to a competitor would be really hard and so they didn't bother.

Obviously for this to be economical normally, the banks must automate the shit out of the problem. That means not just transferring the correct balance but identifying regular payments, notifying payees like employers, and sorting basically everything out. Anything they don't automate ends up as yet more work for their customer services agents, because if it goes wrong they have to pay to fix it. So for TSB right now this is yet another cost they're soaking, and they don't even get to keep the customers, those customers are gone, no take-backs.

the_mitsuhiko · on April 28, 2018

Is there anyone with a TSB account that is not encountering issues at the moment? From what one can read online it sounds proper dire for the bank.

nekitamo · on April 28, 2018

I’m at a major UK university and many acquaintances use TSB. Not a single one of them can log into their account. It has been this way for around a week, as far as I can tell.

Kusnier · on April 28, 2018

I have an account, tried logging in the other day just to see how screwed it was and actually had no issues at all so I guess not everyone is affected (I tried on like Thursday so after the worst of it was done I believe)

merlish · on April 28, 2018

I managed to log in, but the system is pretty barebones.

Trying to change or apply for a new banking product just takes you to a help page saying the ability to do this is 'Coming soon'. (Some other features are scheduled to be available by the 'End of April', for comparison.)

Also, in the 'pending transactions' popdown, e.g. £38.60 is displayed as '38.6'...

peoplewindow · on April 28, 2018

That implies the balances are being represented as floats and turned directly into strings ... how does something that basic happen at all? Is Sabadell's Spanish web UI like that too? No wonder they're screwed

jacques_chester · on April 28, 2018

> That implies the balances are being represented as floats ... how does something that basic happen at all?

Javascript only has floats. For the unwary this is a common source of bugs in frontends to financial systems.

Hopefully the backend is using some kind of decimal type.

tonyedgecombe · on April 28, 2018

I can login now and see my transactions, some parts are still broken though like downloading statements.

valdiorn · on April 28, 2018

I got a letter from TSB earlier this week stating that they have to send me paper statements this month, they are unable to provide paperless statements for the time being.

That just doesn't make sense.

arm85 · on April 28, 2018

Yeah, it's working for me, since about Wednesday/Thursday.

collyw · on April 28, 2018

Any insiders at other banks want o comment on the state of their systems? I would like to know if my banks are at risk.

merinowool · on April 28, 2018

I wonder why exactly this has failed. It feels like when using good practices - especially TDD, this shouldn't have happened. Also wanting to do "big bang release" is a recipe for failure.