ts-python

Series III

Sessions

Sessions

Seminar I: “A Data-Cleaning Deep-Dive with pandas”

“The following tale of alien encounters is true. And by true, I mean false. It’s all lies. But they’re entertaining lies. And, in the end, isn’t that the real truth? The answer is: no.”

— ‘The Springfield Files’ (S08E10)

Materials

Abstract

How do we handle inconsistent, missing, or poorly formatted data in our analyses? How do we ensure that our analyses are accurate if our data may have errors in it?

Do you…

… deal with poorly formatted Excel sheets?
… query and combine data from multiple sources?
… want to spend less time cleaning and more time modeling?
… ensure your data is less noisy?

Then join us for a session on efficient data cleaning in Python!

In this episode we’ll take a deep dive on a single dataset to better understand how we can effectively clean tabular data in pandas. We’ll discuss what clean data is and why untidy data can have disastrous effects while also exploring how we can best leverage pandas tidy our data.

We’ll cover topics like wrangling strings, renaming, parsing datatypes, sorting/reindexing, reshaping, and binning to ensure your data is ready to to be used by your models. Additionally we’ll provide strategies you can use to verify cleanliness of your data.

Keywords:

Topic: Deep dive into a hands-on data cleaning problem
Content:
- A session where the instructor walks through a data cleaning problem from end to end.
- Topics discussed include: combining data from multiple sources and formats, utilities to fill missing data, etc.

Seminar II: “Interactive Visualizations in Python”

“My eyes! The goggles do nothing!”

— ‘Radioactive Man’ (S07E02)

Materials

Abstract

How do you produce interactive visualizations using Plotly or Altair? How can these visualizations drive further insights or communicate powerful messages?

Do you…

… want to better explore your large datasets?
… need to easily share your data with others?
… rapidly iterate on data visualizations?

Then join us for a session on plotting without matplotlib!

Matplotlib is a fantastic tool for publication-quality visualizations and for communicating complex conclusions about static data. But what if we need an interactive visualization—one which we can explore in various ways and which we can dynamically update?

In this episode, we’ll take a look at Javascript-based plotting tools such as Plotly and Altair for interactive visualizations and compare them to what we’re able to do in Matplotlib. We’ll see how these tools can give further insights into our data and allow us to create complex, interactive analyses that can communicate sophisticated conclusions. We’ll seek answers to questions like: where do these new tools improve upon Matplotlib’s API? We’ll discuss how these various tools diverse from each other, and we’ll show examples of using Altair and Plotly for effective data exploration and communication.

Keywords: plotting, visualization, matplotlib, plotly, altair, graph, charts

Seminar III: “Tabular Data Persistence in Python”

“I used to be with ‘it,’ but then changed what ‘it’ was. Now what I’m with isn’t ‘it’ anymore, and what’s ‘it’ seems weird and scary.”

— ‘Homerpalooza’ (S07E24)

Materials

Abstract

How do you persist tabular data; how do you store it to disk? Of all the common formats—.csv, .npy, Parquet, Feather, HDF, etc.—how do you know which to choose and what trade-offs they each present?

Do you…

… need to rapidly (and conveniently) store and retrieve columnar data?
… want to understand more about data persistence?
… want to figure out how to choose between so many different new technologies?

Then come join us for a session on columnar file and database formats!

How do you store tabular data between analyses? How do you store important metadata along with the tabular data? What do you do about parts of your data that may require runtime support (e.g., complex indices or the use of user-defined types?) Why and when is pickle not enough—or even a bad idea?

In this episode, we’ll discuss persisting tabular data for storage to disk or transmission to other users (e.g., to-and-from API endpoints or to-and-from nodes in a distributed computation.) We’ll discuss the difference between text-based formats (e.g., csv, txt, fwf, tst, dat, json,) binary formats (e.g., pickle, npy, parquet, feather,) and database formats (e.g., hdf, sql.) We’ll look to cover which of these is most convenient and performant for persisting data, as well as the contexts when these file formats should be used or avoided.

Keywords: parquet, csv, persistence, feather, tabular, data

Seminar IV: “Debugging in Python”

“The doctor said I wouldn’t have to many nosebleeds if I kept my finger out of there.”

— ‘I Love Lisa’ (S04E15)

Materials

Abstract

How do you find bugs deep in your code? How can you use common tools in Python to ease this process? And, once you find a bug, what can you do to surgically investigate and fix it?

Do you…

… ever struggle to track down bugs in your Python code?
… always use print when debugging?
… wish you knew more about debugging options in Python?

Then join us for a session on debugging Python code!

Dynamic languages like Python allow you to rapidly prototype code without a compiler getting in your way. However, this means that bugs can sometimes slip into our code if we’re not careful.

In this episode, we’ll discuss the various ways you can debug your Python programs. We’ll discuss various techniques for instrumenting live code to track down bugs in large programmes, as well as how to best use tools such as pdb, and ipdb as well as common mechanisms within the sys module.

Keywords: debugging, pdb, Python debugger, ipdb, breakpoint

Seminar V: “Concurrency and Parallelism in Python”

“Kids, you tried your best, and you failed miserably. The lesson is: never try.”

— Burns’ Heir (S05E18)

Materials

Abstract

How do you make use of concurrency mechanisms in Python? How do you decide which of the various approaches to take, and how can these various approaches help you improve your code?

Do you…

… know when to use multithreading vs multiprocessing in Python?
… want to understand when the global interpreter lock will slow you down?
… need to get more done in less time?

Then join us for a session on concurrency and parallelism in Python.

The Global Interpreter Lock (“GIL”) is a specter that haunts concurrency approaches in Python. It eliminates threading as a simple concurrency primitive and forces us to develop a much more sophisticated view of concurrency and parallelism.

In this episode, we will discuss concurrency approaches in Python, touching upon standard library modules such as threading, multiprocessing, multiprocessing.shared_memory, and concurrency.futures. We’ll discuss the motivation behind using these tools, their limitations, their relationship to the Global Interpreter Lock (“GIL,”) existing workarounds to avoid the GIL, and other related approaches for getting more work done in less overall time. (Note: this session will not cover asyncio.)

Keywords: threading, multiprocessing, concurrency, parallem, parallelism, GIL, Global Interpreter Lock

Seminar VI: “Memory profiling in Python (and pandas)”

“But I predict that within 100 years, computers will be twice as powerful, 10000× larger, and so expensive that only the five richest kings of Europe will own them.”

— ‘Much Apu About Nothing’ (S07E23)

Materials

Abstract

How do you profile memory usage in your Python code? How do you find the spots in your code where you are allocating too much memory? And, critically, what do you *do about it?*

Do you…

… want to predict how much memory your Python code will require?
… need to identify areas in your code that hog memory?
… wish you could use pandas in a more memory efficient fashion?

Then come join us for a session on memory profiling in Python (and pandas)!

Python does not provide fine-grained control over memory allocation, deallocation, or layout. Tools like pandas often encourage making copies, going so far as to remove keyword arguments like inplace=… from their APIs. Though it may be the case for many of our analyses that we can simply buy a bigger computer, there are still circumstances under which we may need to control, predict, and reduce our memory usage.

In this episode, we will cover memory profilers in Python (memray) and how to effectively use them to locate, investigate, and address memory-intensive code. We’ll develop intuition around memory use, discuss how Python allocates and deallocates memory, and discuss approaches we can take in code using pandas to address memory issues and to improve the memory efficiency of our analyses.

Keywords: profiling, memory, pandas, Python

Seminar VII: “Dashboarding In Python”

KRUSTY THE CLOWN: “Let’s just say it moved me… TO A BIGGER HOUSE! Oops! I said the quiet part loud and the loud part quiet.”

— “A Star is Burns” (S06E18)

Materials

Abstract

How do you create an interactive dashboard in Python? How do you deploy and share that dashboard? And, critically, can we do this without being a front-end developer?

Do you…

… need to visually explore your data?
… need to enable others to rapidly understand one or more data sets?
… need to automatically update your visualizations on a recurring or streaming basis?

Then come join us for a session on dashboarding in Python!

Python is a premiere programming language for scientific computation, whether its used for data exploration, visualization, or reporting. While Python has always supported interactive visualizations via matplotlib its web-based dashboarding capabilities have been limited to rendering static plots onto the web.

In this episode, we will provide an overview of some Python based dashboarding tools like panel, bokeh, and holoviews. These tools provide a convenient mechanism to create complex dashboards enabling you to better understand your data, and even share that understanding with your colleagues. We’ll develop an understanding about the uses of a data dashboard, the core features of these Python tools, and how these frameworks are implemented to effectively work with them without needing to be a front-end developer.

Keywords:

dashboard, data-viz, Python

Seminar VIII: “All About Generators”

LYLE LANLEY: “So, then, ‘mono’ means ‘one,’ and ‘rail’ means ‘rail.’ And that concludes our intensive three week course.”

— “Marge vs the Monorail” (S04E12)

Materials

Abstract

Do generators have any part in API design? Are generators just some advanced Python novelty feature? And, importantly, how I actually use generators in my day-to-day work?

Do you…

… need to work with streaming data?
… need to flexibly compose complex processing pipelines?
… need to exert control over memory consumption in your Python programs?

Then come join us for a session on generators!

Generators are considered an advanced Python feature and are often viewed only as lazily evaluated iterators. While this is mechanically true, they have a much broader application in program structure and can be used to simplify step-wise processing pipelines.

In this episode, we will move beyond the superficial mechanics of generators in Python and explore their implementation and use-cases. We will develop a better sense for how generators can be leveraged in your every day scripting as a structuring mechanism to reduce your code churn and produce less error-prone results.

Keywords

generators, advanced, Python, memory management, lazy evaluation

Seminar IX: The When & Why of Object Orientation

BART SIMPSON: Todd, you and Data are Team Strike Force. Nelson, that just leaves you and Martin. MARTIN PRINCE, JR: Team Discovery Channel!

— “Lemon of Troy” (S04E24)

Materials

Abstract

Do I need object orientation for encapsulation in Python? Can useful APIs consist entirely of functions? Should metaclasses ever be used? And, most importantly, when do I actually need to write a class?

Do you…

… need to create APIs for your colleagues to use?
… need to understand modules that have minimal documentation?
… want to develop a keener intuition on when to write a class vs a function?

Then come join us for a session on object orientation!

Object orientation is useful because of its unique features like inheritance and composition right? As it turns out, that’s only part of the story. The usefulness of object orientation in Python does not lie in a mechanical understanding of its features, but in its application to decomposing real world problems.

In this episode, we will cover when and why object orientation should be used in Python code and how it can be used appropriately to simplify your life. Additionally, we will discuss common pitfalls of implementing objects and how you can create concise, usable APIs that shorten your development time and ease end-user scripting.

Keywords

object orientation, OO, Python, class, metaclass

Seminar IX: The When & Why of Object Orientation

BART SIMPSON: Todd, you and Data are Team Strike Force. Nelson, that just leaves you and Martin. MARTIN PRINCE, JR: Team Discovery Channel!

— “Lemon of Troy” (S04E24)

Materials

Abstract

Do you…

… need to create APIs for your colleagues to use?
… need to understand modules that have minimal documentation?
… want to develop a keener intuition on when to write a class vs a function?

Then come join us for a session on object orientation!

Keywords

object orientation, OO, Python, class, metaclass

Seminar X: Deep Dive into a Time-Series Problem

JIMMY APOLLO: “Well, folks, when you’re right 52% of the time, you’re wrong 48% of the time.”

— “Lisa the Greek” (S03E14)

Materials

Abstract

Why does pandas need so much additional datetime funcationality compared to numpy? How do you make sense of data collected across different timezones? How do I know if I need a PeriodIndex or a DatetimeIndex? And importantly, how do I keep track of all of these features?

Do you…

… need to unify various timezones in your data?
… need to work around specific dates, such as holidays and weekends, in your analyses?
… need to produce multiple forecasts subject to a variable number of scenarios and constraints?

Then come join us for a session on a time-series analysis in pandas!

Pandas originated in the finance industry. This means that as the library has grown, working with datetimes has been remained in the forefront of supported features. But amongst the vast API that pandas offers, it can be difficult to ensure your time-series analyses are running efficiently.

In this episode, we will walk through a complex time-series analysis problems using pandas. We will demonstrate how you can perform scenario analysis in an extensible and flexible manner, and how you can make the best computational use of pandas datetime interfaces.

Keywords

time-series, pandas, Python, datetime, timezone

Seminar XI: Out With the Old and in With the New: The Evolution of Python’s Standard Library

HOMER SIMPSON: “I’ve gone back to the time when dinosaurs weren’t just confined to zoos.”

— “Treehouse of Horror V” (S06E06)

Materials

Abstract

Ever wonder about the design decision made about Pythons standard library? Why do old “included-batteries” sit around collecting dust instead of being deprecated?

Do you…

… wonder why 3rd party packages are seldom included into the standard library?
… need to make better use of the Python standard library?
… need to update old scripts to take advantage of modern standard library replacements?

Then come join us for a session on Pythons standard library!

Python is touted as a “batteries-included” programming language. But these batteries have been changed and evolved over time, often leaving users with redundant entry points to perform overlapping tasks. For example, to launch a subprocess should one use os.system, os.popen, os.posix_spawn, subprocess.Popen, subprocess.check_output, subprocess.call, or subprocess.run? How do you make a decision amongst these?

In this episode we’ll discuss the most widely used and changed modules in the Python standard library, laying out why upgrades or replacements occurred, and the current recommended approaches for these modules. Additionally, we’ll discuss what 3rd party packages can be used as drop-in replacements for parts of the standard library and why some of these 3rd party packages are preferred.

Keywords

standard library, Python

Seminar XII: Seeing pandas in the window: window operations in pandas `.rolling`, `.expanding`, `.ewm`

OTTO MANN: “Oh, wow! Windows! I don’t think I can afford this place…”

— “You Only Move Twice” (S08E02)

Materials

Abstract

What’s the difference between a windowed operation and a grouped operation? How do I effectively leverage pandas the pandas window API? Under what circumstances will a windowed operation be fast or slow?

Do you…

… run scenario analyses against variable amounts of data?
… need to intelligently smooth noisy signals?

Then come join us for a session on pandas window operations!

pandas has a plethora of ways to slice, dice, and analyze your data. The most common data transformations are aggregations. The ability to describe a large dataset with a just a few numbers is extremely powerful for understanding your data. But what happens is we over aggregate, mistakenly throwing out real signal for the convenience of working with aggregated values? Thankfully window operations are here to help.

In this episode, we’ll look at .rolling, .expanding, and .ewm as well as their various options. These DataFrame methods are just the beginning though, as we’ll also explore the Window, Rolling, Expanding, and ExponentialMovingWindow objects they return. We’ll discuss the uses for these objects, how you can customize them, and how you can ensure your operations are applied efficiently and remain within the Restricted Computation Domain.

Keywords

pandas, window, rolling, expanding, Python

ts-python

Series III

Contents

Sessions

Seminar I: “A Data-Cleaning Deep-Dive with pandas”

Abstract

Seminar II: “Interactive Visualizations in Python”

Abstract

Seminar III: “Tabular Data Persistence in Python”

Abstract

Seminar IV: “Debugging in Python”

Abstract

Seminar V: “Concurrency and Parallelism in Python”

Abstract

Seminar VI: “Memory profiling in Python (and pandas)”

Abstract

Seminar VII: “Dashboarding In Python”

Abstract

Seminar VIII: “All About Generators”

Abstract

Seminar IX: The When & Why of Object Orientation

Abstract

Seminar IX: The When & Why of Object Orientation

Abstract

Seminar X: Deep Dive into a Time-Series Problem

Abstract

Seminar XI: Out With the Old and in With the New: The Evolution of Python’s Standard Library

Abstract

Seminar XII: Seeing pandas in the window: window operations in pandas .rolling, .expanding, .ewm

Abstract

Seminar XII: Seeing pandas in the window: window operations in pandas `.rolling`, `.expanding`, `.ewm`