Many blogs and podcasts discuss the question how to get into data science. In my experience, data scientists mainly come from computer science, physics or statistics. Social scientists are rare among data scientist – but I believe that in many business contexts, social scientists and psychologists can provide a much needed perspective to data analysis. While psychological theories, methods, and statistics provide a good starting point for data science adventures, most social scientists will need to learn and embrace additional skills. In this post, I highlight what I think they need to add to their training if they aim to pursue a data-driven career in what fancy people now call “data science”.

The term “data science” is still used broadly and can mean anything from “putting data into Excel” to either “using t-tests” or “developing a deep neural network for machine vision”. I use the term primarily to mean the whole process of collecting data, preparing and blending data using technical and statistical tools, analysing data (either for inference or prediction) and visualising the results in an understandable and accessible way. This definition is still broad, but it shows the different skill sets needed.^{1}

A single data scientist cannot be an expert in all of these domains, so a Data Science team needs to be diverse in skills and backgrounds. Psychologists and social scientists can bring valuable understanding of the data to the table, especially if the data to be analysed comes from humans (e.g. behaviour, survey responses). Psychological knowledge can help to identify and operationalise variables relevant to the business question. Furthermore, their statistical knowledge – especially in the domain of psychometrics, e.g. structural equation models – helps to make sense of the data collected. Making predictions using black-box neural networks is only one part of a data science solution. In many real-world cases, business clients also want to understand how they need to adapt to their customers’ behaviour and needs. Thus, you need to make sense of the data, the inferences and the predictions.

Machine learning and AI are just one set of tools in a data scientist’s toolbox. While computationally efficient and sometimes superior in terms of out-of-sample prediction, machine learning algorithms have a background in statistics and in many cases, the statistical tools are better suited to provide sensible information. Nevertheless, the first question in the analysis step is to find the right tool for the question at hand. Sometimes, it is a neural network to predict customer segments and, sometimes, it is a structural equation model to investigate relationships between survey responses. The approaches are not mutually exclusive and can benefit by learning from each other.^{2}

## What to learn?

So, if you are a social scientist and like to analyse data, what should you learn to become a data scientist?

### Learn R (or Python)

R is the most versatile statistical software. There is a ton of different, freely available package to conduct nearly any analysis. If you are proficient in R it will be quite easy to learn Python if required (same for the other way round). Or put differently: Learn at least one programming, scripting, or statistical language.

#### Resources

- DataCamp offers a wide range of courses for R and Python.
- If you are more into books,
*R for Data Science*might be the best starting point.

### Deepen your statistics skills

Most psychology courses will require you to learn basic statistical tools. But the *p*-value might not mean what you think it does. So you should read up on statistical foundations and statistical modelling. I like to see basic knowledge of Bayesian statistics, but also hierarchical modelling and Generalized Linear Models in a Maximum Likelihood framework is relevant knowledge. This will allow you to quickly understand other approaches and relate Machine Learning techniques back to the statistical foundations. For applied work, there is rarely a need to go back to the mathematics.

#### Resources

- For a focus on Bayesian-oriented statistical modelling have a look at
*Statistical Rethinking*^{3}and*Bayesian Data Analysis (3rd ed.)*. - In order to have a basic understanding of decision trees and neural networks, have a look at Andrew Ng‘s
*Machine Learning*course on Coursera. - The most elaborate but math-heavy book is probably
*Elements of Statistical Learning*by Hastie et al..

### Understand data & Learn SQL

For many students data means having participants in rows and variables in columns. But this is only one way to represent data. Even if each variable has its own row, it is still data – just another representation. While tidy data is generally a good rule to store and process data, there are instances where other representations are helpful to efficiently perform tasks.

SQL and relational databases might not be the tools you actually use, but understanding JOIN, UNION and the power of relational databases helps a lot to understand data representations and how data can be accessed efficiently. (By the way, having SPSS data files is *not* efficient.)

#### Resources

- An interactive introduction to SQL (which I haven’t read before) is available at SQLBolt.

### Unterstand business challenges

In applied data science, you will need to answer business questions. Those are not answered by a statistical procedure, but by interpreting the results of your analysis. You need to understand what your client wants to know and which data is relevant to this question. This is the overarching environment you will conduct your analysis. You will be able to learn this by actually working or doing internships and continuously thinking and asking colleagues about it. If you are a junior, make sure to be open-minded and learn from your colleagues‘ experience.

This last point is especially crucial if you are not working in a corporate department. Many data science agencies will offer you some algorithm or machine learning solution, but do not provide an elaborate answer in their client’s language. In my experience, this has driven many companies away from actually implementing data-driven decision making. And it will also help you to better understand the challenge and thereby optimising your analysis towards the purpose of your work.

If you have relevant skills and resources to add, feel free to add them in the comments!

- I still do not think “data science” qualifies as being a “science” in the strict sense of the word. It is more a field of engineering than a field of science. ↩
- Yarkoni & Westfall (2017) give a great perspective on how psychology can advance as a science if it adopts machine learning paradigms, most importantly the focus on prediction. ↩
- I cannot recommend this book and the accompanying video lectures enough! ↩