Customer Service 1-800-221-5528

Murach’s Python for Data Analysis

by Scott McCoy
15 chapters, 559 pages, 235 illustrations
Published August 2021
ISBN 978-1-943872-76-3

Read 1 Reviews|Write a Review

Today, data analysts are in demand in all types of fields, with Python as a preferred language. And now, with this book, you can gain the Python data analysis skills you need to broaden your career opportunities…more quickly and easily than you ever thought possible!…using the proven Murach approach. What’s more, after you’ve used this book to master those skills, it will become your all-time favorite on-the-job reference.

College Instructors

Go to our instructor’s site to learn more about this book and its instructor’s materials.

I got my very first Murach book back in 2006 from a local bookstore, not knowing what was inside, and it has changed my life since, literally. Your book format made it easy for me, an accountant with no IT background, to gain skills that proved to be useful throughout my career.”

About this Book
Table of Contents
FREE Downloads
Book FAQs
Corrections
Reviews

What this book does

To make this book work as effectively as possible for you, the content is divided into 4 sections.

Section 1: Get off to a fast start

Section 1 consists of 4 chapters that get you started with data analysis as quickly and effectively as possible.

You’ll learn how to use JupyterLab and Jupyter Notebooks to organize and develop your analyses. You’ll learn how to use a subset of the Pandas module for data analysis and visualization. And you’ll learn how to use a subset of the Seaborn module to create professional data visualizations that can be used for presentations.

When you’re done with this section, you’ll be able to start doing analyses of your own.

Section 2: The critical skills for success on the job

Most analysis is descriptive analysis in which you analyze past data to help you gain new insights. That’s why section 2 of this book presents the critical descriptive analysis skills that you need for success on the job. That includes:

How to read data into a Pandas DataFrame
How to clean the data by dropping unneeded rows and columns and fixing missing values, data types, and outliers
How to prepare the data by adding columns, modifying the data in columns, and combining DataFrames
How to analyze the data by grouping and aggregating the data, using pivot tables, and more
How to analyze time-series data by reindexing, downsampling, and working with rolling windows and running totals

Section 3: An introduction to predictive analysis

Predictive analysis takes data analysis to another level by using statistical models to predict unknown or future values. Although a complete treatment of predictive analysis is far beyond the scope of this book, all analysts should know the basic concepts and skills. That’s why section 3 of this book presents those concepts and gets you started doing your own predictions.

This introduction includes how to find the correlations between variables, how to use Scikit-learn to work with linear regression models, and how to use Seaborn to create and plot linear regression models. It also shows you how to select the right variables and the right number of variables for multiple regressions...one of the critical skills for doing an effective job of making predictions.

Section 4: The case studies

This section presents 4 case studies that show you how the skills you’ve been learning can be applied to real-world datasets:

the polling data for the 2016 presidential election
the US Forest Service data for forest fires
the US social survey data taken from hundreds of polls
the basketball shot location data for NBA player Stephen Curry

Frankly, you can’t master on-the-job skills by working with toy datasets, and these case studies help make sure that you will master the professional skills that you need.

Who this book is for

This book is for anyone who wants to become a data analyst, no matter what the field. The only prerequisite is some programming experience, although it doesn’t have to be in Python.

That’s because chapter 1 presents the minimal set of Python skills that you need for this book: how to import modules; how to call and chain methods; how to code lists, slices, tuples, and dictionaries; and how to continue statements over two lines.

Of course, the more programming experience you have, the faster you’ll move through this book. In fact, our unique presentation methods let you set your own pace. If you have relatively little experience, you can move more slowly and do the exercises at the ends of the chapters. If you have a lot of experience, you can move quickly and apply your new skills on the job right away.

Why you’ll learn faster and better with this book

Like all our books, this one has features that ease the learning curve for you, even though you won’t find them in competing books. Here are a few of those features.

When you page through this book, you’ll see that all of the information is presented in what we call "paired pages", with the essential syntax, guidelines, and examples on the right page and the perspective and extra explanation on the left page.
This helps you learn faster by reading less...and this is the ideal reference format when you need to refresh your memory about how to do something.
To keep the focus on professional practices, the examples in this book are taken from eight analyses of real-world datasets…the four case studies that are presented in section 4 as well as four other analyses. We believe that studying analyses like these is critical to the learning process...and yet you won’t find anything like them in any competing books or online courses.
The short examples taken from these analyses also help you solve specific analysis problems you’re facing. And our paired-pages format makes it easier to find the example that you’re looking for than it is with traditional books that embed the examples in the text.
Like all our books, this one has exercises at the end of each chapter that give you hands-on experience. These exercises also encourage you to experiment and to apply what you’ve learned in new ways…just as you’ll have to do on the job. And because our exercises start from Notebooks that provide the starting code for an analysis, you’ll get more practice in less time.

The perfect companion book

If you haven’t done that much Python programming before you read this book, we would like to recommend the perfect companion book: Murach’s Python Programming. It will help you raise your Python skills to a professional level, and it too is a terrific on-the-job reference.

What software you need for this book

To do data analysis with Python as shown in this book, you just need to download and install the Anaconda distribution of Python. It includes JupyterLab, Pandas, Seaborn, Scikit-learn, and more. To help you install it, appendixes A and B present the procedures you need for both Windows and macOS systems. Then, chapter 1 shows you how to get started with JupyterLab.

What people say about Murach books

“This is my first exposure to Murach’s books, and I love them. I like the organization of the content, the consistent approach in each book, and the accuracy of the material.”
—Bob L., Michigan

“I really like the paired-pages format of detailed information on the left and quick notes on the right. This helps me to quickly find the information I’m looking for.”
—Roxanne T., Student, Washington

“I can’t praise this book highly enough. The clarity used in picking what to include, when to introduce it, and how to do so is remarkable.”
—Charles Ferguson, Software Developer, Australia

“Another thing I like is the exercises at the end of each chapter. They’re a great way to reinforce the main points of each chapter and force you to get your hands dirty.”
—Hien Luu, SD Forum/Java SIG

“Throughout the entire project, your book was indispensable to me. The answers were right there at every turn. All the examples made sense, and they all worked!”
—Alan Vogt, ETL Consultant, Massachusetts

“This book covers the perfect amount of description, and it does not make you bored by providing unnecessary details.”
—Posted at an online bookseller

“I picked up my first Murach book at a local bookstore in 2006, not knowing what was inside or what level of knowledge it would require of me, and it has changed my life since, literally. Your format (the paired pages) made it easy for me, an accountant with no IT or software development background, to understand databases and gain skills that proved useful throughout my entire career.”
—Giovanni Galope, Accountant, Philippines

On Murach’s Python Programming: “This is now my third book for Python, and it is the ONLY one that has made me feel comfortable solving problems and reading code. The paired pages approach is fantastic, and it makes learning the syntax, rules, and conventions understandable for me.”
—Posted at an online bookseller

“Your books shine out from the rest—the quality of writing and presentation of information is topnotch, and the consistency of quality across books is impressive.”
—Nolan Tamashiro, Developer

View the table of contents for this book in a PDF: Table of Contents (PDF)

Click on any chapter title to display or hide its content.

Section 1 Get off to a fast start

Chapter 1 Introduction to Python for data analysis

Introduction to data analysis

What data analysis is

The five phases of data analysis and visualization

The IDEs for Python data analysis

The Python skills that you need for data analysis

How to install and import the Python modules for data analysis

How to call and chain methods

The coding basics for Python data analysis

How to use JupyterLab as your IDE

How to start JupyterLab and work with a Notebook

How to edit and run the cells in a Notebook

How to use the Tab completion and tooltip features

How syntax and runtime errors work

How to use Markdown language

How to get reference information

Two more skills for working with JupyterLab

How to split the screen between two Notebooks

How to use Magic Commands

Introduction to the case studies

The Polling case study

The Forest Fires case study

The Social Survey case study

The Sports Analytics case study

Chapter 2 The Pandas essentials for data analysis

Introduction to the Pandas DataFrame

The DataFrame structure

Two ways to get data into a DataFrame

How to save and restore a DataFrame

How to examine the data

How to display the data in a DataFrame

How to use the attributes of a DataFrame

How to use the info(), nunique(), and describe() methods

How to access the columns and rows

How to access columns

How to access rows

How to access a subset of rows and columns

Another way to access a subset of rows and columns

How to work with the data

How to sort the data

How to use the statistical methods

How to use Python for column arithmetic

How to modify the string data in columns

How to shape the data

How to use indexes

How to pivot the data

How to melt the data

How to analyze the data

How to group the data

How to aggregate the data

How to plot the data

Chapter 3 The Pandas essentials for data visualization

Introduction to data visualization

The Python libraries for data visualization

Long vs. wide data for data visualization

How the Pandas plot() method works by default

The three basic parameters for the Pandas plot() method

How to create 8 types of plots

How to create a line plot or an area plot

How to create a scatter plot

How to create a bar plot

How to create a histogram or a density plot

How to create a box plot or a pie plot

How to enhance a plot

How to improve the appearance of a plot

How to work with subplots

How to use chaining to get the plots you want

Chapter 4 The Seaborn essentials for data visualization

Introduction to Seaborn

The Seaborn methods for plotting

The general methods vs. the specific methods

How to use the basic Seaborn parameters

How to use the Seaborn parameters for working with subplots

How to enhance and save plots

How to set the title, x label, and y label

How to set the ticks, x limits, and y limits

How to set the background style

How to work with subplots

How to save a plot

How to create relational plots

How to create a line plot

How to create a scatter plot

How to create categorical plots

How to create a bar plot

How to create a box plot

How to create distribution plots

How to create a histogram

How to create a KDE or ECDF plot

How to enhance a distribution plot

Other techniques for enhancing a plot

How to use other Axes methods to enhance a plot

How to annotate a plot

How to set the color palette

How to enhance a plot that has subplots

How to customize the titles for subplots

How to set the size of a specific plot

Section 2 The critical skills for success on the job

Chapter 5 How to get the data

How to find the data that you want to analyze

Common data sources

How to find and select the data that you want

How to import data into a DataFrame

How to import data directly into a DataFrame

How to download a file to disk before importing it

How to work with a zip file on disk

How to get database data into a DataFrame

How to run queries against a database

How to use a SQL query to import data into a DataFrame

How to work with a Stata file

How to get and explore the metadata of a Stata file

How to build DataFrames for the metadata and the data

How to work with a JSON file

How to download a JSON file to disk

How to open a JSON file in JupyterLab

How to drill down into the data

How to build a DataFrame for the data

Chapter 6 How to clean the data

Introduction to data cleaning

A general plan for cleaning the data

What the info() method can tell you

What the unique values can tell you

What the value counts can tell you

How to simplify the data

How to drop rows based on conditions

How to drop duplicate rows

How to drop columns

How to rename columns

How to find and fix missing values

How to find missing values

How to drop rows with missing values

How to fill missing values

How to fix data type problems

How to find dates and numbers that are imported as objects

How to convert date and time strings to the datetime data type

How to convert object columns to numeric data types

How to work with the category data type

How to replace invalid values and convert a column’s data type

How to fix data problems when you import the data

How find and fix outliers

How to find outliers

How to fix outliers

Chapter 7 How to prepare the data

How to add and modify columns

How to work with datetime columns

How to work with string columns

How to work with numeric columns

How to add a summary column to a DataFrame

How to apply functions and lambda expressions

How to apply functions to rows or columns

How to apply user-defined functions

How lambda expressions work with DataFrames

How to apply lambda expressions

How to work with indexes

How to set and remove an index

How to unstack indexed data

How to combine DataFrames

How to join DataFrames with an inner join

How to join DataFrames with a left or outer join

How to merge DataFrames

How to concatenate DataFrames

How to handle the SettingWithCopyWarning

What the warning is telling you

What to do when the warning is displayed

What to watch for when the warning isn’t displayed

Chapter 8 How to analyze the data

How to create and plot long data

How to melt columns to create long data

How to plot melted columns

How to group and aggregate the data

How to group and apply a single aggregate method

How to work with a DataFrameGroupBy object

How to apply multiple aggregate methods

How to create and use pivot tables

How to use the pivot() method

How to use the pivot_table() method

How to work with bins

How to create bins of equal size

How to create bins with equal numbers of values

How to plot binned data

More skills for data analysis

How to select the rows with the largest values

How to calculate the percent change

How to rank rows

How to find other methods for analysis

Chapter 9 How to analyze time-series data

How to reindex time-series data

How to generate time periods

How to reindex with datetime indexes

How to reindex with a semi-month index

How a user-defined function can improve a datetime index

How reindexing with an improved index can improve plots

How to resample time-series data

How to use the resample() method

How to use the label and closed parameters when you downsample

How downsampling can improve plots

How to work with rolling windows

The concept of rolling windows

How to create rolling windows

How to plot rolling window data

How to work with running totals

How to create running totals

How to plot running totals

Section 3 An introduction to predictive analysis

Chapter 10 How to make predictions with a linear regression model

Introduction to predictive analysis

Types of predictive models

Introduction to regression analysis

How to find correlations between variables

The Housing dataset

How to identify correlations with a scatter plot

How to identify correlations with a grid of scatter plots

How to identify correlations with r-values

How to identify correlations with a heatmap

How to use Scikit-learn to work with a linear regression

A procedure for creating and using a regression model

The function and methods for linear regression models

How to create, validate, and use a linear regression model

How to plot the predicted data

How to plot the residuals

How to plot regression models with Seaborn

The lmplot() method and some of its parameters

How to plot a simple linear regression

How to plot a logistic regression

How to plot a polynomial regression

How to plot a lowess regression

How to use the residplot() method to plot the residuals

Chapter 11 How to make predictions with a multiple regression model

A simple regression model for a Cars dataset

The Cars dataset

How to create a simple regression model

How to plot the residuals of a simple regression

How to work with a multiple regression model

How to create a multiple regression model

How to plot the residuals of a multiple regression

How to work with categorical variables

How to identify categorical variables

How to review categorical variables

How to create dummy variables

How to rescale the data and check the correlations

How to create a multiple regression that includes dummy variables

How to improve a multiple regression model

How to select the independent variables

How to test different combinations of variables

How to use Scikit-learn to select the variables

How to select the right number of variables

Section 4 The case studies

Chapter 12 The Polling case study

Get and display the data

Import the modules that you will need

Get the data

Display the data

Clean the data

Examine the data

Drop columns and rows

Rename columns

Fix object types

Fix data

Take an early plot with Pandas

Save the DataFrame

Prepare the data

Add columns for grouping and filtering

Create a new DataFrame in long form

Take an early plot of the long data with Seaborn

Add monthly bins to the DataFrame

Add an average percent column for each month

Save the wide and long DataFrames

Analyze the data

Plot the national and swing state polls

Plot the voter types

Plot the last two months of polling

Plot the gap changes in selected states

More preparation and analysis

Prepare the gap data for the last week of polling

Plot the gap data for the last week of polling

Prepare the weekly gap data for the swing states

Plot the weekly gap data for the swing states

Chapter 13 The Forest Fires case study

Get the data

Download and unzip the SQLite database

Connect and query the database

Import the data into a DataFrame

Clean the data

Examine the data

Improve the readability of the data

Drop unnecessary rows

Drop duplicate rows

Convert dates to datetime objects

Check for missing contain dates

Prepare the data

Add fire_month and days_burning columns

Examine the contain_date and days_burning columns

Analyze the data

Analyze the data for California

Two more plots for California fires

Rank the states by total acres burned

Prepare a DataFrame for total acres burned by year within state

Prepare a DataFrame for the top 4 states

Plot the acres burned total by year for the top 4 states

Review the 20 largest fires in California

Use GeoPandas to plot the fires on a map

Use GeoPandas to plot the California map

Use GeoPandas or Seaborn to plot the California fires on a map

Plot the fires in the continental United States

Chapter 14 The Social Survey case study

Introduction to the Social Survey

Download and unzip the zip file for the data

Build a DataFrame for the metadata

The employment data

Use the codebook and read the data that you want

Prepare the data

Plot the data and reduce the number of categories

Plot the total counts of the responses

Convert the counts to percents and plot them

The work-life balance data

Search the codebook for small question sets

Read and review the work-life data

Plot the responses for the first question

Plot the responses for the second and third questions

How to expand the scope of the analysis

Use the codebook to find related columns

Use the codebook to find follow-up questions

Select the columns for an expanded DataFrame

Bin the data for a column

How to use a hypothesis to guide your analysis

Develop and test a first hypothesis

Develop and test a second hypothesis

Develop and test a third hypothesis

Chapter 15 The Sports Analytics case study

Get the data and build the DataFrame

Get the data

Build the DataFrame

Clean the data

Locate and drop unneeded rows

Locate and drop unneeded columns

Convert the game_date column to datetime data

Prepare the data

Add a column for the season

Add a column for the shot result

Add a column for points made for each shot

Add three summary columns

Plot the summary data

Plot the points per game by season

Plot the averages of shots, shots made, and points per game by season

Plot the shot locations

Plot the shot locations for two games

Plot the shot locations for two seasons

Plot the shot density for one season

Plot the shot density for two seasons

Appendixes

Appendix A How to set up Windows for this book

How to install and use Anaconda

How to install Anaconda

How to use the Anaconda Prompt

How to use the Anaconda Navigator

How to install and use the files for this book

How to install the files for this book

How to make sure Anaconda is installed correctly

How to download the large data files for this book

Appendix B How to set up macOS for this book

How to install and use Anaconda

How to install Anaconda

How to run conda commands

How to use the Anaconda Navigator

How to install and use the files for this book

How to install the files for this book

How to make sure Anaconda is installed correctly

How to download the large data files for this book

Free chapter

To get a better idea of how well this book can work for you regardless of your level of experience, you can download the first chapter of this book in PDF format.

Chapter 1: Introduction to Python for data analysis

This chapter gives you an overview of what data analysis entails and what you’ll learn in the rest of the book. So after a quick review of the required Python skills, you’ll learn how to use JupyterLab for developing data analyses, and you’ll be introduced to the 4 case studies that are used throughout the book.

Chapter 1 PDF Download Now

Book analyses and exercises

This download includes:

The JupyterLab Notebooks for the data analysis examples and case studies in the book
The starting JupyterLab Notebooks for the exercises at the end of each chapter
The solutions to those exercises

Appendix A for Windows and appendix B for macOS show how to install and use these files.

Zip file Download Now

December 2023 update

In December 2023, we updated the download for this book to work with the latest versions of Pandas, Seaborn, and scikit-learn. If you want to use this download, it's available here:

Updated zip file Download Now

The code in the updated download doesn't always match the code presented in the book, but the changes are summarized here:

PDF Summary of updates Download Now

To view the "Frequently Asked Questions" for this book in a PDF, just click on this link: View the questions

Then, if you have any questions that aren't answered here, please email us. Thanks!

To view the corrections for this book in a PDF, just click on this link: View the corrections

Then, if you find any other errors, please email us so we can correct them in the next printing of the book. Thank you!

Our Ironclad Guarantee

You must be satisfied. Try our print books for 30 days or our eBooks for 14 days. If they aren't the best you've ever used, you can return the books or cancel the eBooks for a prompt refund. No questions asked!

Contact Murach Books

For orders and customer service:

1-800-221-5528

Weekdays, 8 to 4 Pacific Time

murachbooks@murach.com

College Instructors

If you're a college instructor who would like to consider a book for a course, please visit our website for instructors to learn how to get a complimentary review copy and the full set of instructional materials.