.rolling, .expanding, .ewm
“The following tale of alien encounters is true. And by true, I mean false. It’s all lies. But they’re entertaining lies. And, in the end, isn’t that the real truth? The answer is: no.”
— ‘The Springfield Files’ (S08E10)
How do we handle inconsistent, missing, or poorly formatted data in our analyses? How do we ensure that our analyses are accurate if our data may have errors in it?
Do you…
Then join us for a session on efficient data cleaning in Python!
In this episode we’ll take a deep dive on a single dataset to better understand how we can effectively clean tabular data in pandas. We’ll discuss what clean data is and why untidy data can have disastrous effects while also exploring how we can best leverage pandas tidy our data.
We’ll cover topics like wrangling strings, renaming, parsing datatypes, sorting/reindexing, reshaping, and binning to ensure your data is ready to to be used by your models. Additionally we’ll provide strategies you can use to verify cleanliness of your data.
Keywords:
“My eyes! The goggles do nothing!”
— ‘Radioactive Man’ (S07E02)
How do you produce interactive visualizations using Plotly or Altair? How can these visualizations drive further insights or communicate powerful messages?
Do you…
Then join us for a session on plotting without matplotlib!
Matplotlib is a fantastic tool for publication-quality visualizations and for communicating complex conclusions about static data. But what if we need an interactive visualization—one which we can explore in various ways and which we can dynamically update?
In this episode, we’ll take a look at Javascript-based plotting tools such as Plotly and Altair for interactive visualizations and compare them to what we’re able to do in Matplotlib. We’ll see how these tools can give further insights into our data and allow us to create complex, interactive analyses that can communicate sophisticated conclusions. We’ll seek answers to questions like: where do these new tools improve upon Matplotlib’s API? We’ll discuss how these various tools diverse from each other, and we’ll show examples of using Altair and Plotly for effective data exploration and communication.
Keywords: plotting, visualization, matplotlib, plotly, altair, graph, charts
“I used to be with ‘it,’ but then changed what ‘it’ was. Now what I’m with isn’t ‘it’ anymore, and what’s ‘it’ seems weird and scary.”
— ‘Homerpalooza’ (S07E24)
How do you persist tabular data; how do you store it to disk? Of all the common formats—.csv, .npy, Parquet, Feather, HDF, etc.—how do you know which to choose and what trade-offs they each present?
Do you…
Then come join us for a session on columnar file and database formats!
How do you store tabular data between analyses? How do you store important
metadata along with the tabular data? What do you do about parts of your data
that may require runtime support (e.g., complex indices or the use of
user-defined types?) Why and when is pickle not enough—or even a bad idea?
In this episode, we’ll discuss persisting tabular data for storage to disk or
transmission to other users (e.g., to-and-from API endpoints or to-and-from
nodes in a distributed computation.) We’ll discuss the difference between
text-based formats (e.g., csv, txt, fwf, tst, dat, json,) binary
formats (e.g., pickle, npy, parquet, feather,) and database formats
(e.g., hdf, sql.) We’ll look to cover which of these is most convenient and
performant for persisting data, as well as the contexts when these file formats
should be used or avoided.
Keywords: parquet, csv, persistence, feather, tabular, data
“The doctor said I wouldn’t have to many nosebleeds if I kept my finger out of there.”
— ‘I Love Lisa’ (S04E15)
How do you find bugs deep in your code? How can you use common tools in Python to ease this process? And, once you find a bug, what can you do to surgically investigate and fix it?
Do you…
print when debugging?Then join us for a session on debugging Python code!
Dynamic languages like Python allow you to rapidly prototype code without a compiler getting in your way. However, this means that bugs can sometimes slip into our code if we’re not careful.
In this episode, we’ll discuss the various ways you can debug your Python
programs. We’ll discuss various techniques for instrumenting live code to track
down bugs in large programmes, as well as how to best use tools such as pdb,
and ipdb as well as common mechanisms within the sys module.
Keywords: debugging, pdb, Python debugger, ipdb, breakpoint
“Kids, you tried your best, and you failed miserably. The lesson is: never try.”
— Burns’ Heir (S05E18)
How do you make use of concurrency mechanisms in Python? How do you decide which of the various approaches to take, and how can these various approaches help you improve your code?
Do you…
Then join us for a session on concurrency and parallelism in Python.
The Global Interpreter Lock (“GIL”) is a specter that haunts concurrency approaches in Python. It eliminates threading as a simple concurrency primitive and forces us to develop a much more sophisticated view of concurrency and parallelism.
In this episode, we will discuss concurrency approaches in Python, touching
upon standard library modules such as threading, multiprocessing,
multiprocessing.shared_memory, and concurrency.futures. We’ll discuss the
motivation behind using these tools, their limitations, their relationship to
the Global Interpreter Lock (“GIL,”) existing workarounds to avoid the GIL, and
other related approaches for getting more work done in less overall time.
(Note: this session will not cover asyncio.)
Keywords: threading, multiprocessing, concurrency, parallem, parallelism, GIL, Global Interpreter Lock
“But I predict that within 100 years, computers will be twice as powerful, 10000× larger, and so expensive that only the five richest kings of Europe will own them.”
— ‘Much Apu About Nothing’ (S07E23)
How do you profile memory usage in your Python code? How do you find the spots in your code where you are allocating too much memory? And, critically, what do you *do about it?*
Do you…
Then come join us for a session on memory profiling in Python (and pandas)!
Python does not provide fine-grained control over memory allocation,
deallocation, or layout. Tools like pandas often encourage making copies, going
so far as to remove keyword arguments like inplace=… from their APIs. Though
it may be the case for many of our analyses that we can simply buy a bigger
computer, there are still circumstances under which we may need to control,
predict, and reduce our memory usage.
In this episode, we will cover memory profilers in Python (memray) and how to
effectively use them to locate, investigate, and address memory-intensive code.
We’ll develop intuition around memory use, discuss how Python allocates and
deallocates memory, and discuss approaches we can take in code using pandas
to address memory issues and to improve the memory efficiency of our analyses.
Keywords: profiling, memory, pandas, Python
KRUSTY THE CLOWN: “Let’s just say it moved me… TO A BIGGER HOUSE! Oops! I said the quiet part loud and the loud part quiet.”
— “A Star is Burns” (S06E18)
How do you create an interactive dashboard in Python? How do you deploy and share that dashboard? And, critically, can we do this without being a front-end developer?
Do you…
Then come join us for a session on dashboarding in Python!
Python is a premiere programming language for scientific computation, whether
its used for data exploration, visualization, or reporting. While Python has
always supported interactive visualizations via matplotlib its web-based
dashboarding capabilities have been limited to rendering static plots onto the
web.
In this episode, we will provide an overview of some Python based dashboarding
tools like panel, bokeh, and holoviews. These tools provide a convenient
mechanism to create complex dashboards enabling you to better understand your
data, and even share that understanding with your colleagues. We’ll develop an
understanding about the uses of a data dashboard, the core features of these
Python tools, and how these frameworks are implemented to effectively work with
them without needing to be a front-end developer.
Keywords:
LYLE LANLEY: “So, then, ‘mono’ means ‘one,’ and ‘rail’ means ‘rail.’ And that concludes our intensive three week course.”
— “Marge vs the Monorail” (S04E12)
Do generators have any part in API design? Are generators just some advanced Python novelty feature? And, importantly, how I actually use generators in my day-to-day work?
Do you…
Then come join us for a session on generators!
Generators are considered an advanced Python feature and are often viewed only as lazily evaluated iterators. While this is mechanically true, they have a much broader application in program structure and can be used to simplify step-wise processing pipelines.
In this episode, we will move beyond the superficial mechanics of generators in Python and explore their implementation and use-cases. We will develop a better sense for how generators can be leveraged in your every day scripting as a structuring mechanism to reduce your code churn and produce less error-prone results.
Keywords
BART SIMPSON: Todd, you and Data are Team Strike Force. Nelson, that just leaves you and Martin. MARTIN PRINCE, JR: Team Discovery Channel!
— “Lemon of Troy” (S04E24)
Do I need object orientation for encapsulation in Python? Can useful APIs consist entirely of functions? Should metaclasses ever be used? And, most importantly, when do I actually need to write a class?
Do you…
Then come join us for a session on object orientation!
Object orientation is useful because of its unique features like inheritance and composition right? As it turns out, that’s only part of the story. The usefulness of object orientation in Python does not lie in a mechanical understanding of its features, but in its application to decomposing real world problems.
In this episode, we will cover when and why object orientation should be used in Python code and how it can be used appropriately to simplify your life. Additionally, we will discuss common pitfalls of implementing objects and how you can create concise, usable APIs that shorten your development time and ease end-user scripting.
Keywords
BART SIMPSON: Todd, you and Data are Team Strike Force. Nelson, that just leaves you and Martin. MARTIN PRINCE, JR: Team Discovery Channel!
— “Lemon of Troy” (S04E24)
Do I need object orientation for encapsulation in Python? Can useful APIs consist entirely of functions? Should metaclasses ever be used? And, most importantly, when do I actually need to write a class?
Do you…
Then come join us for a session on object orientation!
Object orientation is useful because of its unique features like inheritance and composition right? As it turns out, that’s only part of the story. The usefulness of object orientation in Python does not lie in a mechanical understanding of its features, but in its application to decomposing real world problems.
In this episode, we will cover when and why object orientation should be used in Python code and how it can be used appropriately to simplify your life. Additionally, we will discuss common pitfalls of implementing objects and how you can create concise, usable APIs that shorten your development time and ease end-user scripting.
Keywords
JIMMY APOLLO: “Well, folks, when you’re right 52% of the time, you’re wrong 48% of the time.”
— “Lisa the Greek” (S03E14)
Why does pandas need so much additional datetime funcationality compared to
numpy? How do you make sense of data collected across different timezones?
How do I know if I need a PeriodIndex or a DatetimeIndex? And importantly,
how do I keep track of all of these features?
Do you…
Then come join us for a session on a time-series analysis in pandas!
Pandas originated in the finance industry. This means that as the library has grown, working with datetimes has been remained in the forefront of supported features. But amongst the vast API that pandas offers, it can be difficult to ensure your time-series analyses are running efficiently.
In this episode, we will walk through a complex time-series analysis problems using pandas. We will demonstrate how you can perform scenario analysis in an extensible and flexible manner, and how you can make the best computational use of pandas datetime interfaces.
Keywords
HOMER SIMPSON: “I’ve gone back to the time when dinosaurs weren’t just confined to zoos.”
— “Treehouse of Horror V” (S06E06)
Ever wonder about the design decision made about Pythons standard library? Why do old “included-batteries” sit around collecting dust instead of being deprecated?
Do you…
Then come join us for a session on Pythons standard library!
Python is touted as a “batteries-included” programming language. But these
batteries have been changed and evolved over time, often leaving users with
redundant entry points to perform overlapping tasks. For example, to launch a
subprocess should one use os.system, os.popen, os.posix_spawn,
subprocess.Popen, subprocess.check_output, subprocess.call, or
subprocess.run? How do you make a decision amongst these?
In this episode we’ll discuss the most widely used and changed modules in the Python standard library, laying out why upgrades or replacements occurred, and the current recommended approaches for these modules. Additionally, we’ll discuss what 3rd party packages can be used as drop-in replacements for parts of the standard library and why some of these 3rd party packages are preferred.
Keywords
.rolling, .expanding, .ewmOTTO MANN: “Oh, wow! Windows! I don’t think I can afford this place…”
— “You Only Move Twice” (S08E02)
What’s the difference between a windowed operation and a grouped operation? How do I effectively leverage pandas the pandas window API? Under what circumstances will a windowed operation be fast or slow?
Do you…
Then come join us for a session on pandas window operations!
pandas has a plethora of ways to slice, dice, and analyze your data. The most common data transformations are aggregations. The ability to describe a large dataset with a just a few numbers is extremely powerful for understanding your data. But what happens is we over aggregate, mistakenly throwing out real signal for the convenience of working with aggregated values? Thankfully window operations are here to help.
In this episode, we’ll look at .rolling, .expanding, and .ewm as well as their various options. These DataFrame methods are just the beginning though, as we’ll also explore the Window, Rolling, Expanding, and ExponentialMovingWindow objects they return. We’ll discuss the uses for these objects, how you can customize them, and how you can ensure your operations are applied efficiently and remain within the Restricted Computation Domain.
Keywords