Data Science Is Not Taught At Universities - And Here Is Why

Data Science Is Not Taught At Universities - And Here Is Why

Have you ever wondered why it takes so painfully long to deliver projects like predictive models or segmentations?

Why there is such a scarcity of data science skills on the market?

Why is it when you hire a freshly baked Ph.D. he or she fails to deliver anything useful for the first year of work (and then quits to another job)?

In my view the answer is rather simple - a massive skill gap amongst university scientists. Despite the course names, like ‘Business Analytics’ or ‘Data Science’ I would venture an opinion that the vast majority of the scientists leading them have no idea how ‘Data Science’ in the ‘Business world’ really looks like. They are not even close. And what’s worse – to the students’ harm, they are perfectly happy with it.

I recognise it because I used to be a university scientist myself and - much more importantly, I have been regularly interviewing and employing graduates in the last 8 years. As such I have had first-hand view of the gap between the teaching and the practice as well as grads’ expectations towards the job comparing to the much harsher reality.

Business Understanding – Find Your Target

..30 different answers given by customers and it’s clear as mud which ones are related to success or failure.

University learning leads you to believe that phase 1 of CRISP-DM is quite simply about gathering the context and deciding on a modelling objective. Let's leave business politics to the local managers and let the data do the talking!

That’s not quite true – someone somewhere already decided on the modelling objective and as for the context there is a long list of practical issues to clarify, of which I will pick the most important – yet very much ignored by University science: Target Definition.

I can see the University folks' puzzled looks... Yes, they agree that having a clear historical indication who was 'good' and who was 'bad' is critical for the model's predictive power.. It's just all the tables they use in machine learning research already have the target information clearly defined. Here comes the famous IRIS dataset, then the Wisconsin Breast Cancer, there is even Credit Risk or Telco Churn data and they all have the Target column there so what the hell am I talking about?

The problem is that in real life the Target flag is NEVER there.

For churn modelling you may have many churn types on the system and need to distil the few that need modelling. And hey - when a subscriber moves from Postpaid contract to Prepaid – is this a churn or not? (‘Yes’ – says the Postpaid Base Manager, ‘No’ says the CEO ). You have to make the call and someone in the Business might not like it – so much for neutralité politique . Once you have your Churn Type and Churn Date, you also need to find the Non-Churn Dates. Yes, I can see the head-scratching.

To take another down-to-earth, every-frickin-time example:  for Cross Sell Modelling you may reach for the call centre logs but..  there is no sale/non-sale flag. Turns out it’s 30+ different answers given by customers and it’s clear as mud which ones are related to success or failure.

And what if you are asked to build the cross-sell model for e-gaming company but there has been no historical campaign at all to learn from? (Hint - you may reach for the existing cross-product holding data)

What if you are asked to build a Fraud Detection Model but there is not a single fraud properly recorded. (Hint – Predictive Modelling may not the be your best shot, maybe you should look at Anomaly Detection techniques)

Even banks with a clear loan default definition enforced by regulation realise that it’s not fit for purpose and they try to create their own – different to the regulatory targets to build better models.

If you still think it’s an easy thing to sort out – let me tell you that in one of the telecoms we spent half a year figuring out who of the customers actually churned in the past and when. That was longer than the follow-up modelling project.

 

Data Understanding & Preparation 
-  How Good Is Your SQL Kung-Fu? 

Feature Engineering is by far more impactful on predictive accuracy than anything you can do in the Modelling phase.

Despite what you may have practiced in the lab - these two phases are NOT about exploratory analysis, drawing the distributions, dealing with outliers or missing values in your Modelling Table. Why?

Because you don’t have the Modelling Table yet.

Maybe in the lab, you were handed over a nice Modelling Table to load it to R or SAS Enterprise Miner but in real life that almost never happens.

Your source will be a database with tens or hundreds of tables, millions of records, usually after 3 painful migrations with gaps in history, columns without descriptions and with no one around to answer your questions. That’s right, we're not in Kansas anymore.

The task has its own name - Feature Engineering and it’s a hellishly laborious, manual and painful process.  Despite the enormity of the work that goes into Feature Engineering I still haven't found even one proper book about it.

If you think there is a team of SQL developers waiting to build that table for you – you are delusional. It's all on you. We code SQL, test SQL and debug SQL for 70%-90% of project time. Sometimes it’s SAS, more recently hacking around with Java or Python.

If you haven’t done Feature Engineering from messy relational data sources – it means you have no idea what to do for this most important part of the Predictive Modelling project.

Oh yes – you heard me correctly, Feature Engineering is by far more impactful on predictive accuracy than anything you can do in the Modelling phase.  

 

Modelling – Access Denied

There is no way a graduate – even a Ph.D., should be allowed to independently build a model. They would be flooded by leaks from the future, not notice that half of the dataset got mysteriously missing during model training (SAS message about it is really tiny), over-fit the model and walk into a dozen of other traps without even noticing. And you need to disarm all of them, because even one left behind  may result in a completely useless model. 

My point is – even having gone through a good course, it takes years of experience to become accomplished enough to deliver safe and reliable models. No one becomes a Data Scientist because of completing a Data Science course.

  

Get Real, University

Students learn that Advanced Analytics means playing with Artificial Intelligence in R all day long

Universities focus on machine learning techniques (i.e. the Modelling phase only) because this is the cool stuff. They do not want to engage in researching and teaching the much more important Data Preparation process because it looks so uncool in comparison. Even its name sucks. In a university lab everyone runs neural networks. No one writes the feature-generating SQL.

This approach massively skews students’ understanding of the future job. They learn that Advanced Analytics means playing with Artificial Intelligence in R all day long. When they find out that it takes months of repetitive coding, requires psychological stamina of an A&E surgeon and a paramount attention to detail – they are confused, disappointed and frustrated.

If Universities truly want to teach Business Analytics instead of ‘Machine Learning on the Cleanest Datasets Ever’ course they need to work with students on real, messy data where nothing is obvious and one deadly trap follows another. Drag their students through the mud, following the CRISP-DM process from start to end and then they will see for themselves if they are cut out for this kind of work.

And those who like it, will tell this story on job interviews and will get the job on the spot – because these are the skills employers are looking for.

Maciek Wasiak

CEO of Xpanse AI - Automated Data Science platform

6y

Thanks for kind words Tina S.! The article is 2.5-year-old by now and since then we developed software that automates the largest parts of Data Preparation for Predictive Analytics – you can check out www.xpanse.ai if interested :)

Like
Reply
Tina S.

DATA ANALYSIS PROFESSIONAL ◄► MBA, BUSINESS ANALYTICS

6y

Great article. As a current student in a Business Analytics program, this article is spot on. I am fortunate to be learning the messy stuff at work! I would also say, this is mostly true for all degree programs except for maybe finance and accounting with their hard fast rules. Schools and universities can’t keep up with real world applications. It's on the employer to get their team members there. And who really invests in proper learning and development from the start? I've never experienced it.

Lars Juhl Jensen

Professor at the NNF Center for Protein Research in Copenhagen. My lab specializes in network biology and text mining.

6y

Fun read but overgeneralizes about universities. I have yet to be involved in a data-mining project at a university, where someone handed me a dataset that was in good shape and had a clear target flag attached. Every project has been like what is here described as the real world (as opposed to what you're taught at university).

Neal Dunkinson

Vice President Solutions & Professional Services at SciBite (Acquired by RELX/Elsevier in 2020)

6y

We're partcipating in the following approach to help tackle exactly this challenge. https://www.cambridgenetwork.co.uk/news/helping-apprentices-lead-the-field-in-big-data/

Arno Germond

Permanent research scientist (PI) in Cell Bioimaging, INRAE (France). Ex RIKEN (Japan). Founder Cellcityhub

6y

Hehe, I am very excited at the idea of joining the next eigenvector seminar organized by Barry M. Wise

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics