Who is a Data Scientist?
Konstantin Golyaev, BSc in Economics'2004 HSE, MAE NES’2006, Ph.D., University of Minnesota 2011, explains how this interdisciplinary field is related to a graduate degree in economics
Data scientist is a person who is better at statistics than any software engineer and better at software engineering than any statistician.
Josh Wills
Director of Data Engineering@Slack
Since there are almost no formal university degrees in data science, people tend to come into this field from all walks of life: computer science, physics, statistics, economics, political science, and so on. As far as economists are concerned, it is much easier to transition to data science for applied econometricians than for, e.g., decision theorists, although anything is possible if one is willing to work hard for it.
What is data science, really? The best way to answer this is by considering what data scientists do. When I interview data science candidates, I look for three sets of skills. Their relative importance varies based on the project particulars, but we generally seek people who are strong in at least one of them and deficient in none.
First, a data scientist needs to be able to code. The holy trinity of programming languages for data science are R, Python, and SQL. A successful candidate should have some knowledge of SQL, be comfortable using either R or Python and at least be conversant in the remaining one. Most of the time you would be expected to obtain data for analysis yourself, and companies tend to store their data in databases or distributed computing clusters, such as Hadoop. A candidate who can only code in Stata or Matlab would be depending on their teammates to supply them with data, and they would be at a serious disadvantage because of that. The ability to write decent code tends to be strongly positively correlated with individual’s productivity: People who are bad at coding generally struggle to get their work done in time.
Second, a data scientist must be able to perform statistical modeling. This is usually the dimension where economists, and particularly econometricians, tend to shine. The core competency of a scientist is to understand the limits of applicability of models. This can be illustrated with a simple example of comparing averages between two samples, which is typically done using a t-test. Like any model, the t-test relies on a number of assumptions, some of which are vastly more important than others. For example, the assumption that both samples come from a normal distribution — which never actually holds true in practice — can be ignored as long as you have enough data, and we almost always do these days. On the other hand, the assumption that two samples must be independent is critical for interpreting the difference in means causally, and when this assumption is violated, the entire analysis is usually not salvageable. It is the job of a scientist to be able to guide the team on which assumptions of the models are non-negotiable.
Finally, data scientists must develop business acumen. Translating the questions that are relevant to the business into the language of statistical modeling, as well as translating answers from models into ‘business-speak’ is generally much harder than it sounds. A good example is the problem of predicting customer churn — i.e., whether a given customer would stop dealing with the company. This is usually modeled by predicting the churn probability for every customer, which can be done using logistic regression, frequently called ‘logit’ in econometrics. The logit model works by selecting parameters that give rise to a distribution of predicted probabilities that is as close to the actuals as possible. From a business standpoint, however, a model that predicts the entire distribution reasonably well is likely not very useful. In contrast, a model that predicts the top quantiles of the distribution really well, but does a poor job everywhere else, can be much more useful. At the end of the day, the business cares about identifying a subset of customers that are most likely to quit so that they can be contacted and convinced to stay, so particular attention will be paid to customers who have very high predicted churn probabilities.
Most successful data scientists I know tend to have Ph.D.s. It is rarely a prerequisite, and I suspect it is largely an artifact of people making transitions from other fields of study into data science. It is likely easier to get hired with a Ph.D. since you must have been required to finish an independent research project to obtain one, which is a good predictor that you could do this again.
There are two fundamental differences between data science and applied econometrics. First, many econometricians have substandard coding habits, which is a luxury not available to data scientists. In academia, there is little-to-no reward for developing high-quality, reproducible code, so researchers respond to incentives and write code that few others can use or understand. For people who enjoy writing good, clean code that can be reused later, returns on time investment in academia are virtually non-existent, whereas such skills in data science are highly regarded.
Second, many data science questions tend to focus on predictive modeling, whereas econometrics has a rich set of tools for causal inference. The question of whether predictive modeling is more important than a clean identification strategy is well beyond the scope of this note. While the techniques used are frequently very similar between the two applications, the focus can be quite different, and it can take some time getting used to how data scientists approach problems. The strength of economists is their ability to handle problems for which no correct answer is available, e.g., measuring demand elasticity or estimating impacts of interventions, such as promotions.
Finally, I will outline my personal experience in data science with a Ph.D. in Economics. I was quite fortunate to get hired into Amazon as the first full-time economist. My field was applied econometrics, so I had some experience writing code in Stata and Matlab and working with reasonably large datasets. It became apparent very quickly that I’d have to pick up SQL, R, and Python to be able to interact with non-economists at Amazon, and I did so over the course of my first couple of years.
In my 4.5 year tenure at Amazon, I had to work on a bewildering variety of problems. Most notably, I developed and oversaw the deployment of an econometric model that defines the rules by which the Amazon marketplace operates. Any time you purchase anything on Amazon, my model decides which merchant gets your business by default. In addition, I implemented a number of ad hoc forecasting solutions for internal customers at Amazon, including hardware infrastructure, HR headcount planning, used-textbooks pricing, and a few others. Amazon can be a pretty intense place to work, and my general advice to anyone who wonders if it is the right place for them is to ask yourself how much of a tech geek you really are. Working in a cutting-edge tech company means you get to deal with a lot of software that is not quite mature; if getting error messages from programs makes you frustrated, you might not enjoy working for Amazon.
For the last year or so, I have been working at Microsoft in Azure Machine Learning. I work on internal revenue forecasting problems for various divisions of Microsoft finance. I also develop a toolbox of forecasting methods that allows users to combine traditional time series models, such as Arima, with modern machine-learning methods, such as gradient boosted regression trees and random forests. The toolbox is written entirely in R and perhaps one day we will decide to open source it so that anyone can use it as they see fit."
This year Kostya co-authored a book that his colleagues refer to as "one of the most useful book of the year" - A Gentle Introduction to Effective Computing in Quantitative Research: What Every Research Assistant Should Know (MIT Press) - available on Amazon.