SpinVox: Behind the spin

Sometimes it’s hard to know what to believe. Over the years I’ve witnessed more ‘live demos’ than I care to remember (more, probably, than I do remember) and if I had to derive one, abiding rule from them all, it would be this: Most demos don’t work – and the ones that do, more often than not, are rigged.

This isn’t cynicism, by the way, this is reality. I’m always happy to see these things function as they’re supposed to, and I don’t revel in the failure of others (much). The pre-commercial demo of predictive text input solution T9 that I received years ago from its designer was brilliant; flawless in execution and as simple a concept as you could wish for. Soon after the demo, T9 began its surge to ubiquity.

Just as simple a concept is that touted by SpinVox. It’s a service that takes your voicemails and turns them into text messages, so you can read them and re-read them at your convenience, without having to dial into and navigate a time-consuming voicemail application. And, like T9, SpinVox works. I’ve been using SpinVox for some time and it does a decent enough job.

The difference between Spinvox and T9 is that the guy who designed T9 was happy to explain at length how his solution worked. The people at SpinVox, on the other hand, are a little cagey about the details. So when I, along with a number of other journalists were invited to the firm’s headquarters in Marlow, Buckinghamshire to see what it promised would be a full demonstration of its solution in use, I signed up straight away.

SpinVox was doing this in response to a wave of criticism from commentators and bloggers – some of whom have claimed to be current and past employees of the firm – that has suggested the firm’s technology claims are bogus and that it is in dire financial straits. Central to these claims is the accusation that SpinVox is misleading the market about the extent to which it uses human beings to transcribe users’ text messages.

The firm has never denied using people when its technology is flummoxed by factors like pace of speech or accent but the allegations suggest that the firm uses people more often than not and that its voice to text conversion software is weak. If it does rely heavily on people, the detractors say, the business cannot scale because of the rocketing wage bill. This, they say, explains the financial problems the firm faces. SpinVox doesn’t deny that times are tight – it has asked staff to take stock in lieu of wages and delayed payments to suppliers – but argues this is just down to the current economic climate.

So was I – along with the other hacks – about to find out the truth about SpinVox’ technology? Was I about to see the firm prove that its finances are sound?

Marlow is a comfortable, well-off, breezily confident sort of town full of comfortable, well-off, breezily confident people. At SpinVox HQ, though, the building positively bristled with anxiety. It was probably the most uncomfortable press briefing I’ve ever been too – tucked, as I was, into a room with two other journalists and 11 SpinVox personnel, around half of which worked in some sort of PR capacity. There were more than a few worried expressions, not least the one worn by Rob Wheatley, SpinVox’s CIO who delivered the presentation.

Filming, photography and recording were banned, which didn’t bode well.

We were promised a demo of the SpinVox system and the opportunity to ask questions, which is exactly what we got. Unfortunately I’m not sure we got the answers we were looking for.

There’s no doubt SpinVox has technology that can do impressive things with voice messages in a mobile environment, and the company has been open that its business combines a variety of human and software agents to deliver its end results. But the big question is whether this business model is scalable and cost effective at the same time.

Wheatley showed us a network diagram of the SpinVox brain, or D2 as it is known, and explained how each bit works in some detail. Apparently this kind of information is the company’s “secret sauce” and is the reason we weren’t allowed to film. I’ve no idea how much any of this info might help competitors so I won’t go into any great detail, but it seemed a sensible, even obvious, approach to the problem of converting voicemails to text messages. In what way it could be commercially sensitive I’m just not sure.

What Wheatley seemed most proud of and, indeed, what most of the company’s granted patents cover, is how the automated system fails over in real time to a human operator.

Undoubtedly, the combination of humans and machines will deliver better results than a fully automated system, and in Wheatley’s own words this is the SpinVox “balancing act” between more accurate but more expensive humans, and cheaper, but less accurate machines.

SpinVox has clearly spent a lot of time and money on development. The brains behind the underlying technology are Tony Robinson and Philip Woodland of the Cambridge University Machine Intelligence Laboratory, and Tenzing, the software used by the human agents has some pretty fancy speech prediction technology involved – the training course to use it lasts two weeks. But again, speech recognition is a complicated and costly enterprise, and it seems that, to all intents and purposes, SpinVox is still heavily reliant on human interaction.

‘The brain’ is always learning, and in the case of SpinVox, it’s learning dynamically. Whereas the traditional approach to speech recognition is to update the brain with new vocabulary and semantics offline, and periodically release a new version of the platform, SpinVox does this on the fly.

Wheatley would not talk about the number of humans involved in the business, and this is the fundamental problem that the firm has. If it cannot answer accusations that its reliance on human beings is sufficient to threaten the business model with hard facts about its number of employees, then it will never be able to dispel concerns over its feasibility. This information is “commercially sensitive” he said, perhaps meaning that it is fundamental to the firm’s commercial viability.

But he said that costs differ dramatically from product to product, carrier to carrier and country to country, and even these vary over time. For example, when SpinVox launches a new language version of its service it buys a dictionary and loads that into the brain. Then the company sets about expanding and upgrading the brain’s language abilities by hiring people to leave messages and read text, giving out free accounts (like mine), and placing ads in newspapers inviting people to call in and leave messages (presumably in exchange for something).

Then there are people whose job it is to maintain the dictionary, as well as the quality control operators, which have to be able to speak the language they are transcribing for at a native level. This suggests that SpinVox has to bring a new call centre online every time it launches into a new country, which sounds like a significant cost, especially when Wheately boasted that traffic at the company has increased ten times every year for the last three years.

The company has admitted to having five call centres on its books, but also says, somewhat unbelievably, that it doesn’t know how many other call centres these operations sub contract out to. If the firm’s not lying, then it is admitting, at the very least, to some worryingly slack business practices.

At this point founder Christina Domecq “just dropped by”, an appearance that seemed staged at the time, and even more so when we found out that the same coincidental appearance had been granted to the previous group of journalists. What was most interesting to see was the effect her presence had on the rest of the SpinVox crew. She must run a tight ship, as they looked little short of terrified. It was tense.

Domecq answered some financial questions. The company will be cash flow positive in the fourth quarter, having just received another £15m cash injection from existing investors to see it through. Yes, the credit crunch has forced the company to extend its credit terms with suppliers and SpinVox is also in legal disputes with several ‘old’ suppliers over non-payments due to “quality of service issues”. The share offer scheme under which employees substitute some or all of their paychecks for shares is to “help SpinVox manage its cashflow” and will continue. And SpinVox is most definitely not looking for a buyer. At least not yet: “Getting a company to cashflow positive is a big tickbox for an entrepreneur,” she said.

So what did we actually learn? SpinVox has some very clever technology, and claims its workflow patent is “very powerful” but is not a source of revenue. Apparently the firm will not licence its technology, and most of its core innovations will never be patented because the firm doesn’t want them to be made public, which seems to defeat the object somewhat. The firm also claims to have been approached to build other custom products using its speech technology, which may be a source of revenue.

The technology seems to work. I don’t think the demo was something out of the Wizard of Oz; there was no man behind the curtain. Rather there was a female operator and she was in full view. And whether or not the company is on the rocks financially is probably something that won’t be revealed until the last minute. But what SpinVox is claiming is that it can maintain that balancing act between expensive humans and cheap machines and keep it scalable.

The company isn’t lying when it says 100 per cent of messages are automated to some degree – a message going into the system counts as automation, even if the system then hands that message over to a human agent. However, after the visit to Marlow, I’ve come away feeling that SpinVox is much more reliant on humans than it would have you believe, and if that is the case, I have doubts as to whether it has a scalable and cost effective business model.



  1. Avatar Huh 10/08/2009 @ 4:03 pm

    “‘The brain’ is always learning, and in the case of SpinVox, it’s learning dynamically. Whereas the traditional approach to speech recognition is to update the brain with new vocabulary and semantics offline, and periodically release a new version of the platform, SpinVox does this on the fly.”

    What evidence did you see that suggests this is dynamic or in any way real-time.

    • Avatar James Middleton 10/08/2009 @ 4:28 pm

      Hi Huh,
      I didn’t see any evidence, that’s just how Wheatley compared SpinVox to other voice to text systems.
      The demo took place on what was effectively a test network, and it’s possible that it was an elaborate set up. I don’t think it was however.
      SpinVox claims it’s constantly feeding new words into the live system, and its sentence prediction lattice and recognition abilities continue to improve with the more data it’s fed.
      Again, this is all stuff SpinVox claims – so I guess the company’s survival depends upon whether it’s telling the truth about its capabilities.

  2. Avatar iPhone App Developer 10/08/2009 @ 4:21 pm

    The first company to crack voice recognition is likely to dominate the tech landscape for a decade or two; Google, Apple and MS are all in the running.
    As much as I’d like to think a relatively small UK company could win the race I’m not sure it’s a realistic expectation.

  3. Avatar dadge 20/08/2009 @ 11:30 am

    we have been following this story with interest over the last few weeks.

    One of the other important comments which has been made about spinvox is that a lot of people who use the service are under the impression it is fully automated. The fact that humans are involved raises data protection questions.

    Did you have the opportunity to raise this point with them?

    • Avatar James Middleton 20/08/2009 @ 3:24 pm

      This is a tricky one Dadge. The subject did come up but SpinVox has a way of dodging the issue – it claims that its data protection entry says that it does not send data out of Europe as a data *controller*, so stuff like the user’s name and number is not sent out. However, data *processing* is not covered in the entry and isn’t required to be. This means SpinVox can send anonymised data [the content of the voicemail] outside of Europe to their call centres, which in turn, may send it on to subcontractors. I’m no expert but this strikes me as a grey area. It could be that the content of a voicemail could accurately identify an individual for example.

  4. Avatar Keith Kreuz 20/08/2009 @ 3:52 pm

    Thanks for the information, James … as I read your article, I kept thinking that you would compare and contrast SpinVox with Jott (which seems to be the same business model). Have you investigated Jott and, if so, do you think that Jott has a “scalable and cost effective business model”?

  5. Avatar James Middleton 20/08/2009 @ 4:23 pm

    Keith, Jott does indeed appear to have a similar business model. it’s also public knowledge that Jott uses a combination of humans and machines to do its transcription, and it’s also worth noting that Jott recently said it had significantly decreased its reliance on humans and was using more machines. Perhaps this is beacuse of cost issues? I don’t know. But if you scan the Jott forums, around the same time people started complaining about accuracy issues with the service. This might be coincidence.
    Whilst I think SpinVox and Jott have cool technology, and do a great job – I’m a satisfied user of the former – the question remains whether it can be done as well using machines alone [or as close as possible]. I think there are still great technological leaps that need to be made in this area for this to happen.
    In the meantime, the more people that use the service, the more expensive it will likely be for the providers, which is why I’ll be watching SpinVox’s Latin American rollout with interest.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.