Crushed it! Landing a data science job

Data science interviews are the worst because data science is interdisciplinary: code for “you have to know everything about all the disciplines.”  Depending on the company and the team, your interview might look like a software developer’s interview, or it might look a like a statistician’s interview, and the bad news is that virtually none of the material overlaps.  I recently spent a ton of time studying for interviews and I’ve got some hot tips to pass along if you’re thinking about a move soon.


After two amazing years with the Nordstrom Data Lab, I’ve accepted a research scientist position at Amazon Web Services to work on S3.  I’m excited to begin a new chapter of my career, and relieved that the interview process is over because it’s grueling and time-consuming.  Interviews typically consist of one to three screener conversations and then an all-day on-site, and they’re stressful because it’s hard to know what you’ll be asked and often you’re expected to perform feats of intellect that you don’t typically do as a data scientist (at least not devoid of context, from memory and over the phone).

You need time

The best piece of advice I can offer is that if you’re thinking of moving on (or moving into the field) start preparing now.  You want to give yourself a lot of time and not be in cram mode.  Take the time to be sure that you can explain core concepts in your own words.  Screening questions are commonly phrased like this:  “how would you explain to an engineer how to interpret a p-value?”  Explain it to an engineer, someone who, presumably, isn’t a statistician and might not be used to that language.  You don’t want it to be the first time you’ve had to rephrase basic definitions like that.  Also, don’t underestimate what nerves can do to your ability to recall information, even stuff you really thought you understood.  If you’re new to the field, you might need to give yourself more time to prepare if a lot of the concepts are unfamiliar to you.

I also highly recommend spending time on the preparation of your professional materials, i.e. résumé and cover letter.  There are two camps on this it seems, those who think it matters and those who don’t.  Do interviewees look at that stuff with any detail?  Hard to say generally, but I did a ton of interviews at Nordstrom and I can tell you that I was very critical of résumés and letters.  Typos are unacceptable, a letter where the applicant brags is a red flag, weak materials indicate lack of interest (or lack of respect for the reader), and keyword stuffing was an open invitation for me to ask about where and why the applicant applied the methods.  In the broader technology industry I think people tend to believe the myth that all anyone cares about is what you’ve got up on GitHub, but most companies, big companies, don’t look at your GitHub, they look at your résumé and cover letter (this might also come as a shock, but technology isn’t a meritocracy).  Ultimately these documents are how you’ve chosen to represent yourself professionally, so they should matter to you even if you think they don’t matter to your interviewers.

If you didn’t try it, you probably don’t know it

I recommend doing a lot of practice problems and being very analytical about your weak areas.  Many falsely believe that reading and rereading is an effective study strategy, but it isn’t effective when you’re required to solve probability zingers and logic puzzles live (highly recommend Make it Stick, maybe before you study).  By mindfully doing practice problems you’ll know immediately where you’re weak and that will help you prioritize how you study.  Wasting time on stuff you know pretty well is a procrastination strategy, and I thought you were busy?  Also, this is a technical field and you should be prepared to answer questions at a substantially technical level.  If possible, I recommend doing practice problems standing at a whiteboard just to make sure you’re comfortable writing that way and speaking while you write.  You can find lots of tips and interview questions on Quora.

Set up for my first round of interviews.

Office setup for my first round of interviews while still a PhD student at the University of Michigan. I was very green, transitioning out of my field, and terrified of not knowing something. This level of obsession is not healthy nor recommended.

Learn as much as you can about the role ahead of time

Did you know that an informational interview is a thing?  Until a friend used this strategy, I didn’t either!  Sometimes you’ll find that the interview process got going, but you’re not even sure you want the job.  You can tell them to slow their roll and just do an informational interview where you can learn more about whether pursuing the job is something you want.  Also take the time to “stalk” the company and people you’ll be interviewing with.  For example for my AWS onsite I looked up everyone I’d be interviewing with and spent some time on LinkedIn understanding their background.  This can sometimes help you guess what types of things they’re liable to ask you.  Oh, she’s an engineer so she probably won’t ask about stats, but she’ll want to hear about scaling methods up.  Wait, but she’s a principal engineer, so maybe she’ll actually want to hear more about my leadership and inter-personal abilities.  Ellen Chisa’s got a lot of great tips on what not to do in an interview as well.

On to the resources!

You can reasonably be expected to be asked about the following topics:  Statistics, Machine Learning, Forecasting, Algorithms, everything an undergrad CS major should know, and then the scalability and performance associated with all those things.  Oh also, you should be prepared to program, typically in a language of your choice.  Easy peasy right?!

Books

It probably doesn’t matter which one, but get an intro probability book.  I used my trusty old Ross, a standard undergrad text in probability.  If you have Ross, I recommend doing the self-tests in chapters 1 – 5 and using those tests to help you decide where to spend more time.  Combinatorics and basic probability questions are the norm for phone screens so make sure you’re comfortable doing them.  I also used Casella and Berger, basically the Bible for statisticians, to review the properties of expectations and variance.  Generally I’d say that text is probably more advanced than is required in most interviews.

For the CS related topics I primarily used Programming Interviews ExposedCracking the Coding Interview and Programming Pearls.  Exposed is definitely the most comprehensive of the books and if you only have time to look over one, go with that one.  Cracking is very succinct and specific to the interview processes at the big boy companies like Amazon, Google and Facebook but isn’t super generalizable.  The version I was using also had some really irritating vignettes about making sure you’re “a guy they want to get beer with” that were so bro-y I quit using the book (I expected more from Gayle).  Pearls is not an interview book at all.  It’s a collection of problems in computing and mental narratives of approaches to solving them.  This book isn’t really for studying as much as it is for reasoning about computing and it’s a great read if you’ve got time.

Coursera
Some classes allow you to view archived sections so you don't have to wait to see the lecture materials.

Some classes allow you to view archived sections so you don’t have to wait to see the lecture materials.

Coursera is literally the shit.  If you got rid of your old textbooks or don’t want to buy anything, you can easily get by with the material on Cousera.  I hiiiiighly recommend the Biostatistics bootcamps from Johns Hopkins.  They are an excellent review of the first year of a graduate level statistics program.  Don’t spend too much of your time watching the lectures.  Instead test yourself with the quizzes and assignments and watch the videos in areas where you are weak.  Also check out the data science specialization which is offered from the same folks and covers applied skills like exploratory data analysis and programming in R.  Andrew Ng‘s machine learning course is a must and is quite enjoyable.  He does a great job of motivating methods and spends a lot of time building intuition which is very valuable for phone screens where you might not jump into technical details but still need to demonstrate familiarity.  The cloud computing specialization was also great for me since I was gunning for the job at AWS.  I’m transitioning industries again from retail technology to cloud computing and I wanted to get a better sense for the types of problems that I’d be expected to discuss.  In this case I just watched videos so I could absorb the language people use to describe the field rather than focus too much on the technical details.  I’m always on the prowl for great classes on Coursera, so if you have recommendations leave them in the comments!

Coursera used to make me crazy because they enforce this antiquated notion of start and end times.  I recently discovered that many courses allow you to view archived lecture materials so you can learn the material without having to wait for the class to start.  This was a game changer for me, so check it out.

Good luck!

That’s about all I’ve got in terms of the tangibles. But I’ll leave you with a couple platitudes.  First, stay calm!  You won’t be able to recall your knowledge when you’re all keyed up.  This is something I have trouble with, which is why I do crazy things like write down everything I know and tape it to the wall, but that’s not recommended behavior.  My new crazy strategy is to do a bunch of jumping jacks a couple minutes before a screener call so that I’m sweaty and out of breath.  Also, if you’re local, ask to do screeners in person.  I give great face, and I’ve found that I do a lot better when I can see the interviewer than I do over the phone.

Don’t forget you’re interviewing them too and trust your initial impressions.  I had an informational interview with a start-up and left with the feeling that the interviewer was arrogant and not really listening to what I was saying, but I thought the work seemed interesting.  I did a follow-up and all my reservations were confirmed a million times over.  It was a terrible experience and a total waste of my time that could have been avoided if I’d trusted my gut feeling that these people were douches. Interesting work isn’t worth spending a minimum of 8 hours a day with people who won’t respect you.

Finally, try not to compare your experience to those of others, because you might have it wrong and it might just bum you out.  I happened to be interviewing at about the same time as a number of colleagues whom I know well.  I was pretty shocked and, at the time, angry about my experience compared with some of theirs.  Without going into specifics, I interviewed the same week for the same job in the same office as a male colleague with less experience.  He got to do his screener in person with someone from the team he’d be on and was asked very rudimentary questions about dice roll probabilities.  I had to do my screener on the phone with someone from a different office and was asked to find the optimal strategy of a game theory problem. It’s hard to hear that and not read into it, and it’s harder not to not be angry.  Now I interpret that inconsistent interview experience with poor recruiting practices and company-wide immaturity. I don’t want to work somewhere that doesn’t know how to interview for my role and as a result probably hires people I don’t want to work with.

In the end you should prepare as much as you can, but don’t fret if you feel like there are holes in your knowledge.  Trust yourself, trust your impressions, and learn from those bad interviews so that you can crush it in the next one.

 

30 Comments

  • This is an EXCELLENT article and a must-read for anyone interviewing for a data science position. I’m starting to interview again and your crack about ” knowing everything about everything” is dead-on. I could tell you stories… 🙂

    I do have recommendations for Coursera classes for you and your readers:

    • Mining Massive Datasets: Too many new data scientists figure that because they know regression, SVMs, neural nets, and so forth that now they can work on Big Data problems. Not true. Take this course to find out how to deal with these uniquely difficult problems.
    (https://www.coursera.org/course/mmds)

    • Probabilistic Graphical Models: Koller is a well-known expert in the field and this class is not for the timid. But if you want to get up to speed with a powerful, yet still-underappreciated method, this is the one to take. My only complaint is I wish she would have included material on causal modeling using DAGs because questions about causality in Big Data sets will become more crucial as the field moves from predictive analytics to prescriptive.
    (https://www.coursera.org/course/pgm)

    • Core Concepts in Data Analysis: Another advanced course. Remember how you learned how to use the tools of Calculus as a Freshman before taking Real Analysis as a Junior and deeply understanding the fundamentals? That’s what this course is. Take Ng’s course first and then take this one to truly understand what you were actually doing.
    (https://www.coursera.org/course/datan)

    Thanks again for the great article! I’m going to share this with my networks!
    -Mark

    Reply
  • Congratulations on your new gig at Amazon, Erin! Thanks for sharing your experience! I had a phone interviews with Amazon a couple years back when I was making a move to Seattle. But, the interview process took too long. I was waiting in line for interviews with several managers from different departments. So, I moved on.

    I agree with your assessment on Gayle’s book 🙂 The interview book is mainly for fresh graduates. I bought the book for the well compiled interview questions. Here is the site if you don’t want to buy the book – http://www.careercup.com/.

    I took Data Analysis taught by Jeff Leek from John Hopskin. It is a great class. This was before they took Jeff’s class apart and spun off Data Science Specialization. Jeff’s class is great if you have good statistics background. I think Udacity classes are under-represented. They have some really great materials. I would recommend the Data Analysis with R – https://www.udacity.com/course/ud651. This is mainly focusing on data exploration using the visualization in R, which is first part of Jeff Leek’s lesson. But, it is more in-depth such as using plyr library.

    For Statistics 101, I recommend Udacity’s Stats 101 – https://www.udacity.com/course/st101, for those who would like to learn more about Stats. The part I like is Sebastian Thrun’s way of explaining MLE and many more concepts. After Stats 101 and you feel Stats is AWESOME and would like to learn more, I recommend Statistics in Medicine – https://lagunita.stanford.edu/courses/Medicine/MedStats/Summer2014/about. Dr. Sainani is awesome! The best Stats course I have ever taken (https://www.youtube.com/watch?v=ySPm3boeO4c). It has a good mix of application and theories, which I love so very much. For example, it talks about the pitfall of p-value. As the data gets larger, p-value does not mean much and the confidence interval comes into play. In my undergraduate, my stats class focused on studying the formula and proof and some applications of it, but not much. This class makes up for the application parts.

    I also recommend PGM, even though I haven’t taken it. It had conflict with my Neural-Network class on Coursera. By the time I would like to take it, I was accepted to Georgia Tech Master Program. So, it stays on my to-to list.

    One thing about interview I dislike is the questions they asked sometime don’t have anything to do with your day job. Like the example you gave about Game Theory Optimal Strategy. It took me a few minutes to recall an example of Game Theory for multiplayer, non-repeatable game and then explains how to calculate the optimal strategy using min-max strategy. Big words, I know! The reason I still remember because I just finished Georgia Tech Machine Learning class (highly recommend for those who would like to go one step further after taking Andrew Ng’s course). Game Theory is used heavily in Reinforcement Learning. I am curious to find out the usage of RL in the industry.

    One thing for the interview at company like amazon is they don’t care about the right answers but how you approach and solve a particular problem. So, if I used Knowledge-based AI: Human Cognition approach to solve a production problem, I may get high score since I think outside of the box and leverage knowledge from different disciplines.

    Sorry for the long comment. But, I got excited when talking about Machine Learning, Applied Math in Machine Learning or Algorithm in general 😉 Best of luck to your new gig!

    Reply
  • I left out the Stats link for Dr. Sainani’s Stats lesson. Here it is – https://www.youtube.com/user/StatsSpring2013/videos

    Georgia Tech’s Machine Learning course – https://www.udacity.com/course/ud262

    Enjoy learning!

    Reply
  • Great post. After reading it, I think i have something to go by rather than throwing stones in the dark. However, I was wondering if it is at all possible for you to post pdf copies of some of the notes you used here for people who are not necessarily come with a stat/compsci background. I guess it is too much to ask or downright stupid but it could be a lot of help to would be data scientists. Thanks

    Reply
    • Unfortunately my crazy notes are long gone. :<

      Reply
      • I was going to post the same question about the cool crazy notes 🙂 I know that I need to do the same, some sort of cheat sheets to have handy, flashcards etc.. I will do some research and post here, if I find any guides. So far I found out about this one for machine learning (its more of a booklet rather than a sheet) https://github.com/soulmachine/machine-learning-cheat-sheet Anyone out there that have anything for CS and CS algorithm questions, stats, sql?, map reduce (word count and matrix transpose are popular questions) less often I also see Graph Theory, Random Graph, Graph Mining – I will check and post back again

        Reply
  • We talked to some data scientists and took a stab at writing 20 data scientist interview questions that evaluate candidates in three different categories. You can read and study them here: http://www.sas.com/en_us/insights/articles/analytics/data-scientist-interview-questions.html

    Reply
  • Good points to help someone like me. My question is do you have a PhD or what’s your highest qualification? And is PhD a required or preferred for a data scientist job?

    Reply
  • This is *beyond* helpful – thank you so much for this. I’m coming from a similar-ish background (or I guess a lot of people transition halfway out of academia/grad school tracks?!), and this is amazing. Thanks thanks thanks.

    Reply
  • Umm…yup.

    Reply
  • Wow! Thanks so much for sharing your insight and experience.

    Reply
  • Awesome article Erin.. Inspiring and Admiring!!

    Reply
  • Crisp and sharp. Thanks a lot!

    Reply
  • Thanks for the post.
    Recently, I found this course on Coursera by Andrew Ng, https://www.coursera.org/learn/neural-networks-deep-learning/

    Reply
  • Thank you for writing a very helpful post. One more resource I would like to add to crack the coding interview is – https://www.interviewbit.com/google-interview-questions/

    Reply
  • Hi Erin, Thank you so much for sharing your journey of going from student/academic to data scientist. If you still remember, can you tell me what books you had on hand for interviews (as pictured on your desk). I find the usual suspects such as Casella and Berger are too unwieldy for quick reference.

    Reply
  • Hi Erin –
    Loved this post and just wanted to thank you for sharing these insights. Another great resource that I have found for coding interview questions is Byte by Byte. I love the free e-book on dynamic programming and the fact that most resources on there are free.

    Great post for recursion in particular: https://www.byte-by-byte.com/recursion/

    Reply

Submit a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.