Some skills are needed throughout the data science process, like knowledge of a computer programming language. The most popular languages are Python and R. In some disciplines, knowledge of both is needed in order to access libraries or packages created for a specific task.
DATA SCIENCE PROCESS | SKILLS / KNOWLEDGE |
---|---|
1. Framing the problem | Domain knowledge |
2. Data collection |
Database management (My SQL, PostgreSQL, MongoDB) Distributed processing (Apache Hadoop, Spark, Flink) Web scraping and using APIs |
3. Data cleaning |
Pandas for Python. R |
4. Exploratory analysis |
Statistics Data visualization (libraries in Python: Numpy, Matplotlib, Pandas, Scipy. Packages in R: ggplot2, Dplyr) |
5. Modeling and analysis |
Statistical inference Machine Learning (scikit-learn for Python) |
6. Interpretation and communication of results |
Domain knowledge Data visualization (matplotlib, ggplot, seabron, tableau, d3j) Dashboards (Shine for R, Dash for Python) Sharing and documenting code (Jupyter notebooks, R Markdown, creating R or Python packages) |