Customer Service 1-800-221-5528

Home
/
Shop Books
/
Data Analysis Books
/
Murach’s Python for Data Science (2nd Edition)

Murach’s Python for Data Science (2nd Edition)

by Scott McCoy
15 chapters, 588 pages, 240 illustrations
Published May 2024
ISBN 978-1-943873-17-3

No Reviews Yet|Write a Review

These days, data scientists and analysts are in high demand by organizations of all types. This book covers everything you need to hit the ground running with using Python for data science and analysis.

To start, it presents a crash course in using the Pandas and Seaborn libraries for data analysis and visualization. Then, it presents a thorough course in data analysis, including how to use the Scikit-learn library to create statistical models that make predictions. Finally, it presents four case studies that tie these skills together and show how they’re used in the real world.

College Instructors

Go to our instructor’s site to learn more about this book and its instructor’s materials.

In his first at-bat, Scott McCoy smashes this one out of the park! This book is not just informative, it is exciting.”

Scott Spurlock, Software Engineer, Georgia

About this Book
Table of Contents
FREE Downloads
Book FAQs
Corrections
Reviews

Who this book is for

This book is for anyone who wants to become a data scientist or data analyst. The only prerequisite is some programming experience in Python or any other programming language. That’s because chapter 1 presents the Python skills that you need for this book.

What this book does

To present the essential Python skills for data science in a manageable progression and at the right pace, this book is divided into 4 sections.

Section 1: Get off to a fast start

This section gets you started fast. First, you’ll learn how to use JupyterLab and Notebooks to organize and work with Python code. Then, you’ll learn how to use the Pandas and Seaborn libraries for data analysis and visualization. By the end of this section, you’ll be able to start doing analyses of your own.

Section 2: The critical skills for success on the job

This section presents the descriptive analysis skills that are critical for success on the job. That includes how to:

Get data from CSV files, Excel files, JSON files, Stata files, and databases
Clean data by dropping unneeded rows and columns and fixing missing values, data types, and outliers
Prepare data by adding columns, modifying the data in columns, and combining data frames
Analyze data by grouping and aggregating the data, using pivot tables, and more
Analyze time-series data by reindexing, downsampling, and working with rolling windows and running totals

Section 3: An introduction to predictive analysis

This section presents the predictive analysis skills that you need to create statistical models that make predictions. Although predictive analysis is a large topic that could be an entire book of its own, this section presents the concepts you need to get started with it. More specifically, it shows you how to use the Scikit-learn library to create linear regression models to predict numeric values.

Section 4: The case studies

This section presents four complete analyses that show how the skills in this book can be applied to real-world datasets:

Polling data for the 2016 presidential election
Wildfire data from the US Forest Service
US social survey data
Basketball shot data from the NBA (National Basketball Association)

These in-depth analyses make sure that you master the professional skills that are in demand today from organizations of all types.

Why you’ll learn faster and better with this book

This book has features that are designed to make it as easy as possible for you to learn new skills faster and retain them better. Here are a few of those features:

All of the information is presented in paired pages, with the essential syntax, guidelines, and examples on the right page and clear explanations on the left page.
The hundreds of short examples present usable code for tasks that you are likely to need for your own analyses.
The exercises at the end of each chapter provide a way for you to gain valuable hands-on experience without any extra busywork.
The four analyses presented in section 4 use real-world datasets.
The paired-pages format is ideal for reference when you need to refresh your memory about how to do something.

What software you need for this book

The only software that’s needed for this book is the Anaconda distribution of Python. It includes JupyterLab, Pandas, Seaborn, Scikit-learn, and more.

Appendixes A and B show how to download and install this distribution on both Windows and macOS systems. Then, chapter 1 shows how to get started with JupyterLab.

What people said about the first edition

“In his first at-bat, Scott McCoy smashes this one out of the park! This book is not just informative, it is exciting.”
—Scott Spurlock, Software Engineer, Georgia

“I really appreciated the four case studies. They…illustrated all phases of data analysis and visualization.”
—J. Jasperson – Texas A&M University

“Unlike some other books on data analysis with Python, the explanations of how to perform data analysis are thorough rather than terse or with no explanations.”
—Posted at an online bookseller

“This is a fun book for beginners and experienced data scientists.”
—Posted at an online bookseller

What people say about Murach books

“This is my first exposure to Murach’s books, and I love them. I like the organization of the content, the consistent approach in each book, and the accuracy of the material.”
—Bob L., Michigan

“I really like the paired-pages format of detailed information on the left and quick notes on the right. This helps me to quickly find the information I’m looking for.”
—Roxanne T., Student, Washington

“I can’t praise this book highly enough. The clarity used in picking what to include, when to introduce it, and how to do so is remarkable.”
—Charles Ferguson, Software Developer, Australia

“Another thing I like is the exercises at the end of each chapter. They’re a great way to reinforce the main points of each chapter and force you to get your hands dirty.”
—Hien Luu, SD Forum/Java SIG

“Throughout the entire project, your book was indispensable to me. The answers were right there at every turn. All the examples made sense, and they all worked!”
—Alan Vogt, ETL Consultant, Massachusetts

“This book covers the perfect amount of description, and it does not make you bored by providing unnecessary details.”
—Posted at an online bookseller

“I picked up my first Murach book at a local bookstore in 2006, not knowing what was inside or what level of knowledge it would require of me, and it has changed my life since, literally. Your format (the paired pages) made it easy for me, an accountant with no IT or software development background, to understand databases and gain skills that proved useful throughout my entire career.”
—Giovanni Galope, Accountant, Philippines

On Murach’s Python Programming: “This is now my third book for Python, and it is the ONLY one that has made me feel comfortable solving problems and reading code. The paired pages approach is fantastic, and it makes learning the syntax, rules, and conventions understandable for me.”
—Posted at an online bookseller

“Your books shine out from the rest—the quality of writing and presentation of information is topnotch, and the consistency of quality across books is impressive.”
—Nolan Tamashiro, Developer

The perfect companion book

If you haven’t done that much Python programming before you read this book, we would like to recommend Murach’s Python Programming. It can help you raise your Python skills to a professional level, and it's an ideal on-the-job reference.

View the table of contents for this book in a PDF: Table of Contents (PDF)

Click on any chapter title to display or hide its content.

Section 1 Get off to a fast start

Chapter 1 Introduction to Python for data science

Introduction to data science

What data science is

The five phases of data analysis and visualization

The IDEs for Python data science

The Python skills that you need for data science

How to install and import the Python modules for data science

How to call and chain methods

The coding basics for Python data science

How to use JupyterLab as your IDE

How to start JupyterLab and work with a Notebook

How to edit and run the cells in a Notebook

How to use the Tab completion and tooltip features

How syntax and runtime errors work

How to use Markdown language

How to get reference information

Two more skills for working with JupyterLab

How to split the screen between two Notebooks

How to use Magic Commands

Introduction to the case studies

The Polling case study

The Forest Fires case study

The Social Survey case study

The Sports Analytics case study

Chapter 2 The Pandas essentials for data analysis

Introduction to the Pandas DataFrame

The DataFrame structure

Two ways to get data into a DataFrame

How to save and restore a DataFrame

How to examine the data

How to display the data in a DataFrame

How to use the attributes of a DataFrame

How to use the info(), nunique(), and describe() methods

How to access the columns and rows

How to access columns

How to access rows

How to access a subset of rows and columns

Another way to access a subset of rows and columns

How to work with the data

How to sort the data

How to use the statistical methods

How to use Python for column arithmetic

How to modify the string data in columns

How to shape the data

How to use indexes

How to pivot the data

How to melt the data

How to analyze the data

How to group the data

How to aggregate the data

How to plot the data

Chapter 3 The Pandas essentials for data visualization

Introduction to data visualization

The Python libraries for data visualization

Long vs. wide data for data visualization

How the Pandas plot() method works by default

The three basic parameters for the Pandas plot() method

How to create 8 types of plots

How to create a line plot or an area plot

How to create a scatter plot

How to create a bar plot

How to create a histogram or a density plot

How to create a box plot or a pie plot

How to enhance a plot

How to improve the appearance of a plot

How to work with subplots

How to use chaining to get the plots you want

Chapter 4 The Seaborn essentials for data visualization

Introduction to Seaborn

The Seaborn methods for plotting

The general methods vs. the specific methods

How to use the basic Seaborn parameters

How to use the Seaborn parameters for working with subplots

How to enhance and save plots

How to set the title, x label, and y label

How to set the ticks, x limits, and y limits

How to set the background style

How to work with subplots

How to save a plot

How to create relational plots

How to create a line plot

How to create a scatter plot

How to create categorical plots

How to create a bar plot

How to create a box plot

How to create distribution plots

How to create a histogram

How to create a KDE or ECDF plot

How to enhance a distribution plot

Other techniques for enhancing a plot

How to use other Axes methods to enhance a plot

How to annotate a plot

How to set the color palette

How to enhance a plot that has subplots

How to customize the titles for subplots

How to set the size of a specific plot

Section 2 The critical skills for success on the job

Chapter 5 How to get the data

How to find the data that you want to analyze

Common data sources

How to find and select the data that you want

How to import data into a DataFrame

How to import data directly into a DataFrame

How to download a file to disk before importing it

How to work with a zip file on disk

How to get database data into a DataFrame

How to run queries against a database

How to use a SQL query to import data into a DataFrame

How to work with a Stata file

How to get and explore the metadata of a Stata file

How to build DataFrames for the metadata and the data

How to work with a JSON file

How to download a JSON file to disk

How to open a JSON file in JupyterLab

How to drill down into the data

How to build a DataFrame for the data

Chapter 6 How to clean the data

Introduction to data cleaning

A general plan for cleaning the data

What the info() method can tell you

What the unique values can tell you

What the value counts can tell you

How to simplify the data

How to drop rows based on conditions

How to drop duplicate rows

How to drop columns

How to rename columns

How to find and fix missing values

How to find missing values

How to drop rows with missing values

How to fill missing values

How to fix data type problems

How to find dates and numbers that are imported as objects

How to convert date and time strings to the datetime data type

How to convert object columns to numeric data types

How to work with the category data type

How to replace invalid values and convert a column’s data type

How to fix data problems when you import the data

How find and fix outliers

How to find outliers

How to fix outliers

Chapter 7 How to prepare the data

How to add and modify columns

How to work with datetime columns

How to work with string columns

How to work with numeric columns

How to add a summary column to a DataFrame

How to apply functions and lambda expressions

How to apply functions to rows or columns

How to apply user-defined functions

How lambda expressions work with DataFrames

How to apply lambda expressions

How to work with indexes

How to set and remove an index

How to unstack indexed data

How to combine DataFrames

How to join DataFrames with an inner join

How to join DataFrames with a left or outer join

How to merge DataFrames

How to concatenate DataFrames

The SettingWithCopyWarning

What the warning is telling you

How to handle the warning

Chapter 8 How to analyze the data

How to create and plot long data

How to melt columns to create long data

How to plot melted columns

How to group and aggregate the data

How to group and apply a single aggregate method

How to work with a DataFrameGroupBy object

How to apply multiple aggregate methods

How to create and use pivot tables

How to use the pivot() method

How to use the pivot_table() method

How to work with bins

How to create bins of equal size

How to create bins with equal numbers of unique values

How to plot binned data

More skills for data analysis

How to select the rows with the largest values

How to calculate the percent change

How to rank rows

How to find other methods for analysis

Chapter 9 How to analyze time-series data

How to reindex time-series data

How to generate time periods

How to reindex with datetime indexes

How to reindex with a semi-month index

How a user-defined function can improve a datetime index

How reindexing with an improved index can improve plots

How to resample time-series data

How to use the resample() method

How to use the label and closed parameters when you downsample

How downsampling can improve plots

How to work with rolling windows

The concept of rolling windows

How to create rolling windows

How to plot rolling window data

How to work with running totals

How to create running totals

How to plot running totals

Section 3 An introduction to predictive analysis

Chapter 10 How to make predictions with a linear regression model

Introduction to predictive analysis

Types of predictive models

Introduction to regression analysis

How to find correlations between variables

The Housing dataset

How to identify correlations with a scatter plot

How to identify correlations with a grid of scatter plots

How to identify correlations with r-values

How to identify correlations with a heatmap

How to use Scikit-learn to work with a linear regression

A procedure for creating and using a regression model

The function and methods for linear regression models

How to create, validate, and use a linear regression model

How to plot the predicted data

How to plot the residuals

How to plot regression models with Seaborn

The lmplot() method and some of its parameters

How to plot a simple linear regression

How to plot a logistic regression

How to plot a polynomial regression

How to plot a lowess regression

How to use the residplot() method to plot the residuals

Chapter 11 How to make predictions with a multiple regression model

A simple regression model for a Cars dataset

The Cars dataset

How to create a simple regression model

How to plot the residuals of a simple regression

How to work with a multiple regression model

How to create a multiple regression model

How to plot the residuals of a multiple regression

How to work with categorical variables

How to identify categorical variables

How to review categorical variables

How to create dummy variables

How to rescale the data and check the correlations

How to create a multiple regression that includes dummy variables

How to improve a multiple regression model

How to select the independent variables

How to test different combinations of variables

How to use Scikit-learn to select the variables

How to select the right number of variables

Section 4 The case studies

Chapter 12 The Polling case study

Get and display the data

Import the modules that you will need

Get the data

Display the data

Clean the data

Examine the data

Drop columns and rows

Rename columns

Fix object columns

Fix data

Take an early plot with Pandas

Save the DataFrame

Prepare the data

Add columns for grouping and filtering

Create a new DataFrame in long form

Take an early plot of the long data with Seaborn

Add monthly bins to the DataFrame

Add an average percent column for each month

Save the wide and long DataFrames

Analyze the data

Plot the national and swing state polls

Plot the voter types

Plot the last two months of polling

Plot the gap changes in selected states

More preparation and analysis

Prepare the gap data for the last week of polling

Plot the gap data for the last week of polling

Prepare the weekly gap data for the swing states

Plot the weekly gap data for the swing states

Chapter 13 The Forest Fires case study

Get the data

Connect and query the database

Import the data into a DataFrame

Clean the data

Examine the data

Improve the readability of the data

Drop unnecessary rows

Drop duplicate rows

Convert dates to datetime objects

Check for missing contain dates

Prepare the data

Add fire_month and days_burning columns

Examine the contain_date and days_burning columns

Analyze the data

Analyze the data for California

Two more plots for California fires

Rank the states by total acres burned

Prepare a DataFrame for total acres burned by year within state

Prepare a DataFrame for the top 4 states

Plot the acres burned total by year for the top 4 states

Review the 20 largest fires in California

Use GeoPandas to plot the fires on a map

Use GeoPandas to plot the California map

Use GeoPandas or Seaborn to plot the California fires on a map

Plot the fires in the continental United States

Chapter 14 The Social Survey case study

Introduction to the Social Survey

Build a DataFrame for the metadata

The employment data

Use the codebook and read the data that you want

Prepare the data

Plot the data and reduce the number of categories

Plot the total counts of the responses

Convert the counts to percents and plot them

The work-life balance data

Search the codebook for small question sets

Read and review the work-life data

Plot the responses for the first question

Plot the responses for the second and third questions

How to expand the scope of the analysis

Use the codebook to find related columns

Use the codebook to find follow-up questions

Select the columns for an expanded DataFrame

Bin the data for a column

How to use a hypothesis to guide your analysis

Develop and test a first hypothesis

Develop and test a second hypothesis

Develop and test a third hypothesis

Chapter 15 The Sports Analytics case study

Get the data and build the DataFrame

Get the data

Build the DataFrame

Clean the data

Locate and drop unneeded rows

Locate and drop unneeded columns

Convert the game_date column to datetime data

Prepare the data

Add a column for the season

Add a column for the shot result

Add a column for points made for each shot

Add three summary columns

Plot the summary data

Plot the points per game by season

Plot the averages of shots, shots made, and points per game by season

Plot the shot locations

Plot the shot locations for two games

Plot the shot locations for two seasons

Plot the shot density for one season

Plot the shot density for two seasons

Appendices

Appendix A How to set up Windows for this book

How to download the files for this book

How to install Anaconda

How to use the Anaconda Navigator

How to create the murach environment

How to unzip some data and test your setup

How to use the Anaconda Prompt

Appendix B How to set up macOS for this book

How to download the files for this book

How to install Anaconda

How to use the Anaconda Navigator

How to create the murach environment

How to unzip some data and test your setup

How to use Terminal with an environment

The zip file for the book

This zip file includes the data files for the book as well as the JupyterLab Notebooks for:

The examples presented in chapters 1-11
The case studies presented in chapters 12-15
The starting points for the exercises at the end of each chapter
The solutions to those exercises

Zip file Download Now

Sample PDFs

See for yourself how this book can get you started fast with using Python for data science.

Appendix: How to set up your system for this book

This appendix shows how to set up your system so you’re ready to use Python for data science, including instructions for installing the required software and downloading the files for the book examples and exercises.

Windows PDF Download Now
macOS PDF Download Now

Chapter 2: The Pandas essentials for data analysis

This chapter shows how to use the Pandas library to get the data for an analysis into a DataFrame and to clean, prepare, analyze, and visualize that data. These are the essential Pandas skills that you’ll use for almost every analysis.

Chapter 2 PDF Download Now

On this page, we’ll be posting answers to the questions that come up most often about our Python Data Science book. So if you have any questions that you haven’t found answered here, please email us. Thanks!

There are no book corrections that we know of at this time. But if you find any, please email us, and we’ll post any corrections that affect the technical accuracy of the book here. Thank you!

Our Ironclad Guarantee

You must be satisfied. Try our print books for 30 days or our eBooks for 14 days. If they aren't the best you've ever used, you can return the books or cancel the eBooks for a prompt refund. No questions asked!

Contact Murach Books

For orders and customer service:

1-800-221-5528

Weekdays, 8 to 4 Pacific Time

murachbooks@murach.com

College Instructors

If you're a college instructor who would like to consider a book for a course, please visit our website for instructors to learn how to get a complimentary review copy and the full set of instructional materials.