What is it about?

The review is an overview of the perspective that the text presents on what it terms "Data Science".

Featured Image

Why is it important?

The past several decades have seen large advances in systems for collecting data, for databases, for web-based data access, and for data analysis. New data sources, new computing tools, and new statistical theory (this last does not get attention in the text), place on data analysts the demand to effectively marshal available tools needed for effective analysis.

Perspectives

Effective data analysis requires a mix of computing skills, an understanding of statistical theory as it relates to the task in hand, use of all available insights on the processes that generated the data, and careful and informed critique. Critical evaluative skills are required both for taking account of relevant work that others (including other team members) may have done, and for review of one's own work. This is the mix that was required for effective data analysis fifty years ago, and it is the mix that is required now. What has changed, with the extent of change dramatically different between different areas of data analysis, are the individual components of the mix. The introduction draws attention to the extent to which, in this century, "data are arriving by firehose and without design", and often with very limited information content relative to questions that may be of interest. Surely this makes careful and critical evaluation more important than ever, calling on all available sources of insight, to do a really effective job. I am then concerned when the authors claim that while "statistical science overlaps with data science with respect to the objective of extracting information from data, the statistical aspects of the problem pale in comparison with the algorithmic and computational considerations." If they have in mind algorithmic and computational considerations that will assist in getting a really good handle on what the data may have to say, well, that is indeed a large component of what is required. Think tabular and graphical summary, model fitting, model diagnostics, the many uses of simulation, and graphical summary of results. The text (Part I especially) is too overly focused on the matrix manipulation that underly multiple regression. Yes, the linear algebra conversation, whether implemented using Python or R or other code, is an important conversation. But it needs to more than a demonstration of the matrix operations required when there is no strong dependence between explanatory variables, and it needs a strong focus on the use of tools for model evaluation and criticism. This text never gets much further than the matrix algebra required for regression in a standard, fairly straightforward case, and pays inadequate attention to model diagnostics.

John Maindonald
Statistics Research Associates Limited

Read the Original

This page is a summary of: Algorithms for Data Science Brian Steele John Chandler and Swarna Reddy Springer, 2016, xxiii + 430 pages, £49.99/$66.99, hardcover ISBN: 978-3-319-45795-6, International Statistical Review, August 2017, Wiley,
DOI: 10.1111/insr.12224.
You can read the full text:

Read

Contributors

The following have contributed to this page