Roger D. Peng

Johns Hopkins Bloomberg School of Public Health


The field of data science has expanded and grown significantly in recent years, attracting excitement and interest from many different directions. The demand for introductory educational materials has grown concurrently with the growth of the field itself, leading to a proliferation of textbooks, courses, blog posts, and tutorials. This book is an important contribution to this fast-growing literature, but given the wide availability of materials, a reader should be inclined to ask, “What is the unique contribution of this book?” In order to answer that question it is useful to step back for a moment and consider the development of the field of data science over the past few years.

When thinking about data science, it is important to consider two questions: “What is data science?” and “How should one do data science?” The former question is under active discussion amongst a broad community of researchers and practitioners and there does not appear to be much consensus to date. However, there seems a general understanding that data science focuses on the more “active” elements—data wrangling, cleaning, and analysis—of answering questions with data. These elements are often highly problem-specific and may seem difficult to generalize across applications. Nevertheless, over time we have seen some core elements emerge that appear to repeat themselves as useful concepts across different problems. Given the lack of clear agreement over the definition of data science, there is a strong need for a book like this one to propose a vision for what the field is and what the implications are for the activities in which members of the field engage.

The first important concept addressed by this book is tidy data, which is a format for tabular data formally introduced to the statistical community in a 2014 paper by Hadley Wickham. Although originally popularized within the R programming language community via the Tidyverse package collection, the tidy data format is a language-independent concept that facilitates the application of powerful generalized data cleaning and wrangling tools. The second key concept is the development of workflows for reproducible and auditable data analyses. Modern data analyses have only grown in complexity due to the availability of data and the ease with which we can implement complex data analysis procedures. Furthermore, these data analyses are often part of decision-making processes that may have significant impacts on people and communities. Therefore, there is a critical need to build reproducible analyses that can be studied and repeated by others in a reliable manner. Statistical methods clearly represent an important element of data science for building prediction and classification models and for making inferences about unobserved populations. Finally, because a field can succeed only if it fosters an active and collaborative community, it has become clear that being fluent in the tools of collaboration is a core element of data science.

This book takes these core concepts and focuses on how one can apply them to do data science in a rigorous manner. Students who learn from this book will be well-versed in the techniques and principles behind producing reliable evidence from data. This book is centered around the implementation of the tidy data framework within the Python programming language, and as such employs the most recent advances in data analysis coding. The use of Jupyter notebooks for exercises immediately places the student in an environment that encourages auditability and reproducibility of analyses. The integration of git and GitHub into the course is a key tool for teaching about collaboration and community, key concepts that are critical to data science.

The demand for training in data science continues to increase. The availability of large quantities of data to answer a variety of questions, the computational power available to many more people than ever before, and the public awareness of the importance of data for decision-making have all contributed to the need for high-quality data science work. This book provides a sophisticated first introduction to the field of data science and provides a balanced mix of practical skills along with generalizable principles. As we continue to introduce students to data science and train them to confront an expanding array of data science problems, they will be well-served by the ideas presented here.