It's worth nothing that the article does not discuss the reproducibility of results (e.g. with a Jupyter Notebook) and the implementation of said results (e.g. deploying/validating models), both of which matter much more than any code style conventions for data-related projects.
I cannot describe how many times I've been shown results and when asking how to reproduce them, after several notes (and sometimes complaints to higher ups) I eventually get a series of command line arguments or a barely functioning R-script.
These conclusions are too important to be so sloppily produced. We need verification, validation and uncertainty quantification for any result provided to decision makers.
Learning uncertainty propagation in engineering statistics was one of those concepts that seemed to be immediately useful and have far more implications than any textbooks emphasized.
I was very happy to have that background when I took part developing statistical models used for wind hazard analysis on nuclear powerplants in my first job out of college.
To put it cynically, if you other people can reproduce your results, you might not be demonstrating that you are x10 more productive than them.
Which is to say I think starting with the idea that you're aim for x10 is pernicious and tends to create dysfunctional teams. The claim some developers in some circumstances are ten time more productive than others may or may not be true but software development needs processes whose goal is to help an entire team rather than helping an individual to that "level".
Yes, it's better to have an x10 team, rather than an x10 developer.
Developers should strive to better themselves, but it's important not to fool yourself, too. Having a strong team is almost always better from a business point of view.
Indeed. I have seen several teams with a bunch of (self-styled) "10x" devs, and found that the productivity and quality of the team decreases in direct proportion to the amount of "10x" devs on the team.
I shun "rockstar" and "10x" (and whatever other bullshit moniker they will come up with next) team members. Give me a group of smart people that gel well together, and are highly self-confident without egos getting into the way, and we can move mountains.
This 100%. I have read postmortems of some "significant discoveries" which have turned out to only be reproducible on a particular build or software on a single analyst's machine. Or not at all. One "result" turned out to hinge on the iteration order of python dictionaries.
And this definitely didn't happen after those working on tools to help said analysts make reproducible results encouraged the analysts to use said tools... No, that would be crazy.
I completely agree. Almost all of this article appears to have little to do with being a Data Scientist in particular and more to do with some good practices for writing code in general. So the advice itself is fine, just not what I was hoping for based on the title.
Reproducibility, like you say, however, is something that is an issue far more particular to data science, and worth more serious consideration and discussion. Hand-in-hand with that is shareability. I'm a fan of what airbnb has open sourced to address some of those issues in their knowledge repo project: https://github.com/airbnb/knowledge-repo
Hey thanks for the comment! I'm the author of the talk-turned-post :-) You are completely correct about reproducibility being super important in data science workflows. While I did mention it in the post (and in the talk the post was based on), I mentioned it as a part of version control tools. That said, I think it's not something that is focused on enough (obviously I'm guilty of that too) so I plan on doing a follow up post focused on reproducibility and the tools that can help you recreate your results. Kinda putting the "science" back in data science. Really I want an excuse to play around with tools like https://dataversioncontrol.com/ which looks super useful and I mentioned it in the post, but haven't had a chance to use.
Nothing feels cleaner than storing everything (notebook, raw data, cleansed data, misc scripts, etc.) in a docker image when you're finished with the project. Data science and docker are meant to be besties.
I would prefer recommending a stable build process: — a Docker image can be just like having a VM image or that one PC in the corner of the lab nobody is sure is unneeded. It's far better than having nothing or just the result file but it still has the possibility of needing to reverse-engineer the internal state and given how fast the Docker world moves I would not want to bet on format compatibility 5 years out.
Docker could be that stable build process but it requires the additional assertion that there wasn't, say, a truckload of changes made using `docker exec` or a bunch of customizations to files which were copied into the image. Simply putting a note on the source repo which says that might be enough.
Is there a good solution to that problem, though? (Serious question). I recently did a laptop refresh and am using it as an opportunity to solidify my approach to ML development, and would love to hear if there is a good solution to long-term reproducibility. I'm currently leaning towards Docker, but maybe Vagrant or another "pure" VM approach is better...
Not perfectly, but a good start is to keep all the software assets AND data assets you used to train the model.
There needs to be an immutable, high performance read data store that has a 30+ year plan for survival if we're really going to retool our world around expert systems.
I think the only real problem times you'll find are when the architectures are changed. x86, arm, you'd probably want to port your solution images then if ever. There will always be folks emulating hardware in software on new architectures.
You joke it, but it's a major problem that our tech for very stable WORM media has lagged behind demand.
Our use of data has grown so much faster than our network capacity (and indeed, it seems like we're going to hit a series of physical laws and practical engineering constraints here). "Data has gravity" but the only way to "sustainable" hold a non-trivial volume of data for 20 years right now is to run a data center with a big dht that detects faults and replicates data.
I would simply familiarize with just the basics because you don't have to go much further than that to make use of it for research purposes. My usual process involves breaking the process down into multiple stages (cleansing, conformance, testing, reporting), including a data dir, and finally creating a dockerfile that simply adds the data/source to a simple hierarchy and includes all dependencies. As long as you know how to build a dockerfile, you're golden. You can then upload the image to dockerhub, and have somebody else pull the image and run it to reproduce your entire environment. Helps a ton for online DS courses and MOOCs.
So we should sacrifice all the things that actually make a Data Scientist's work valuable in the name of fun and obscuring mistakes?
Fun I almost get, obviously good for productivity (though I think you'd really be sacrificing productive output for non-productive output), but I just don't get where you're even coming from with the "making mistakes more obvious" angle.
I don't understand why we couldn't have some system, perhaps using strace and friends, which tracks everything I ever do, and how every file was created. Then I could just say "how did I make X?"
The worst developers I've worked with take 2 weeks for tickets that should be simple. That would mean doing 1 easy ticket every day or two makes you a 10x....
I've always hated this term and the mindset around it. I think organizational practices, intelligent engineering strategy, etc are far more important to the output of a team than hiring one genius dev.
Did it ever occur to you they might not be bad developers, they're just goofing off because there's no consequence for being slow?
Like when my old work actually started measuring ticket closure times, our best developers were only 2x more productive than our worst ones. But suddenly a lot more tickets were getting closed.
I mean,I know that some complicated tasks needed the best developers, as the worst ones literally were incapable of understanding the code, but then again doesn't that say something about the code itself and how poorly it communicates its intent? Perhaps clever code is simply confusing code...
Those organizational practices and strategy make the best developers better.
If you hire shitty/unqualified developers who cannot communicate, don't know the tools and aren't functional, even the most amazing developer is kneecapped from a productivity point of view because she must be accountable for everything, forever -- the idiots drag her down.
It's like anything else -- if you work at McDonald's, a bunch of slow unmotivated workers will slow down a fast/hard worker. It's just that the value of the labor + output for cheeseburgers is much lower than software!
But also, even this data is questionable in the extreme.
It may simply be that "10x" people who do exist do so in ways that are challenging to observe. As an example, not making difficult-to-detect mistakes early in the software lifecycle that cause major problems later (classic real world example: mongodb). Or that their influence on a software org causes overall productivity improvements.
In any case, it's a toxic myth that pits individuals against each other for demonstrations of productivity. I'm of the opinion it's a "self-defeating prophecy" or a good example of the "basilisk" effects in game theory.
The real 10x developers in my experience are the hardest to measure. Because they pull the whole team by always being helpful and improving things where they see potential. But that doesn't necessarily show up in their results, but in the whole teams results.
Which is why metrics driven organizations in my experience with their disincentive to help others, slow everything down.
Sure from a "task" stand point, but take quality, reusability, unique approaches, business sense, etc. and the best devs add easily 10x value if not more. Right?
Although I agree with the top post that reproducibility of results is important I think software engineering principles are severely lacking in many data scientists. I attempted to deploy other peoples models as a research assistant and the lack of understanding of code style conventions was a big issue. Even now when I go through some new ML system on github many of them have code style issues.
As an aside in academia to share other peoples results you basically need to create a virtualbox image to make it reproducible. I think docker would work but it may be too complicated.
Know how to actually write code, and also understand a broad range of modeling approaches and the math behind them.
The majority of people passing themselves off as data scientists in the traditional corporate world these days are at best unqualified and at worst outright frauds.
In my opinion, this article would make more sense if two things were first defined:
1) What is a data scientist?
2) What would it mean for someone defined in 1 above to be 10x more productive?
Then it says nothing or little about understanding foundations of research, math, and computer science, instead going into superficial things like 'understand the business' and code examples that could be produced by a beginner level programmer.
This is not how to get to 10x, more like barely, possibly competent.
The advice doesn't look bad but the click-bait title really inclined me to skip it, especially since it meant opening with a digression about whether 10x developers even exist rather than the actual content.
The whole time I was sarcastically thinking, "yeah I'm sure you get 10x by choosing consistent naming conventions. That will make up for the months of tearing your hair out trying to learn how ANNs work without a very solid understanding of math/stats."
90% of data scientists do not use neural networks. Of those who do, 90% shouldn't be, and are letting what's fun/interesting get in the way of actually producing value.
The fact of the matter is that if you're not FB/GOOG/AMAZ, the vast majority of what companies need from their data scientists actually requires very little advanced mathematics, and much more focus on rigor, reproducibility, and good deployment/engineering practices.
The main job of a DS is to have creative ideas about how to solve difficult problems. Creativity doesn't come on a schedule. Sometimes I have a rush of ideas that come all together - often because one idea unlocks lots of other ones. Sometimes I spend weeks or months reading papers and tinkering with only of <strike>failures</strike> learning how not to do it to show for it. The only thing that a '10x DS' indicates to me is that you have a lot of low hanging fruit to pick.
Is writing docstrings with argument types a thing in Python? If so, wouldn't these developers benefit from using actual type annotations (or a language with static types)? This is one area where types actually help a great deal with rapid prototyping!
Also, I disagree with the Scala examples and the argument against brevity, but I guess this is the stuff of flamewars. Not only do I not find his more verbose examples any clearer, they also lack context: presumably the full snippet looks like this:
allClothesCount.sortBy(-_._2)
in which case any additional variable names don't help, and brevity makes the snippet clearer.
> Is writing docstrings with argument types a thing in Python? If so, wouldn't these developers benefit from using actual type annotations (or a language with static types)?
In theory yes (at least for Python 3) but it's a bit more nuanced. Python type annotations are still kind of... annoying? You have to import the things you need from `typing`, for example, and refer to classes differently if their definition comes after. A lot of this is based on convention and what IDEs implement since there's no fully fleshed-out standard. I think it's a good idea if done right -- I find myself more productive in TypeScript than JavaScript so it is possible to add value by transplanting a type system into a dynamic language -- but I don't think it's there yet.
As far as other languages, I think Julia has the best chance of eventually overtaking Python and it has a more deeply-embedded awareness of types.
I was thinking the same thing about those sortBy code snippets. I use shorthand for lambdas all the time because the individual item names are implied by the collection.
- a marketplace for getting small doses of top-level expert advice
- more written about real-world, messy, data-science and machine learning implementations
- the vast majority of writing about ML/DS involves the elements. There is a lack of writing about how full systems integrate.
One big takeaway from the article is how easy Algorithmia makes Data Scientists and Teams in productionising Data Science Models which is still a bigger challenge with DevOps and Data Engineers scratching their heads with the model output from Data Scientists.