Welcome to the online book Introduction to Data Science. This book is created to provide a great resource for asynchronous online learning to deal with the current pandemic, where physical lectures are not possible and not all participants may be able to attend lectures, e.g., due to health issues or just because you have to care of a kid.
However, this online book is more than just a short term fix due to the pandemic. This book is an interactive guide for getting started with data science. All code that is used, e.g., to perform analysis or to create visualizations is included and can be re-used by anyone.
The book is provided both online and as a PDF for printing. There are also lecture slides, exercises, and sample solutions.
Target audience
The primary audience are students, who visit my courses at the university. These students usually have (some) computer science background, but there are also always interested students who have only little prior knowledge of computer science. The book is written such that anyone interested in data science can follow most of the book, but maybe not every detail.
Prerequisites
In general, this book is designed to be relatively easy to follow for students at the university level who have some affinity for working with data and ideally also for programming and mathematics. However, without at least some knowledge about mathematics and computer science, a complete understanding of the materials will not be possible, even though you should be able to understand most concepts.
The following skills are required for a complete understanding of this script:
- Basic programming skills, ideally some Python knowledge. All contents of the script, with the exception of the programming examples, should be understandable without programming knowledge.
- Mathematical notations commonly used in higher mathematics. Without this knowledge, you may not understand how all models for data analysis are working. Here are some example of simple formula that you should be able to read:
- $\sum_{i=1}^n x_i$ (sum of the values $x_1$, ..., $x_n$)
- $\Pi_{i=1}^n x_i$ (product of the values $x_1$, ..., $x_n$)
- $x = (x_1, ..., x_d) \in \mathbb{R}^d$ ($x$ is a $d$-dimensional real valued vector)
- Stochastics, especially random variables and their distributions, e.g. normal/gaussian distribution, uniform distribution, exponential distribution, and binomial distribution. Without this knowledge you will have trouble understanding certain problems with data and models.
Installation of Required Software
This online book is written using Juypter Notebooks. You can download the Jupyter Notebooks of each chapter using the download link at the top of the pages. A PDF with the content of all chapters is also planned, but not yet available.
If you want to run the Jupyter Notebooks yourself, i.e., execute the code examples provided, you must install the required software. These are
- Python 3.6 or newer.
- Jupyter Notebook (tested with version 5.7.8, but should likely also work with all newer and most older versions).
- The python libraries numpy, scipy, scikit-learn (at least 0.22), pandas, matplotlib, seaborn, mlxtend, nltk, and wordcloud.
We recommend to setup a local environment as follows:
- Install Python
- Install Jupyter Notebook. We recommend installation with pip, because you get this delivered with Python. More experienced users can also use conda.
- Install the required libraries with pip (
pip install numpy scipy scikit-learn pandas matplotlib mlxtend seaborn nltk wordcloud
). This may require sudo/admin rights. You can also install this in your user space if you add--user
to the command. The creation of the neural network figure in Chapter 7 also requires the packagegraphviz
, which you have to install both with pip, as well as within your operating system.
You should now be able to fire up your installation of Jupyter with the command jupyter notebook
. Jupyter Notebook runs in the browser and initially shows the home folder in your local file system. You can navigate to the destination where you downloaded the notebooks and open them. You could also upload them to the currently open folder using the browser.
Reporting Problems
If you find problems (e.g., spelling errors in chapters that are already proof read, blunders on my part like wrong formulas, formatting problems on the Website or in the PDF), please open an issue on GitHub.