The solution I settled with in the end was:
df.group_by(
"customer", "product_type"
).agg(
pl.col.date.diff().dt.total_days().alias("DiffDays"),
).with_columns(
pl.col.DiffDays.list.mean().round(1).alias("Avg_Days_Between_Deals")
).drop("DiffDays")
I am currently doing quite a bit of revenue data analysis at my job. The other day my manager came up to me and told me that he wanted to add additional KPIs to the analysis. The given dataset looks like that:
date [str] | product_type [str] | customer [str] | revenue [float64] |
---|---|---|---|
2024-05-10 | Product A | Example Corp | 3400.0 |
2024-05-10 | Product A | Beispiel GmbH | 350.9 |
… | … | … | … |
To get a first idea, let’s vibe code this example and see what Claude Sonnet 4 came up with:
def calculate_avg_duration_between_deals(df):
"""
Calculate average duration between consecutive deals for each customer-product combination
"""
return (
df.sort(["customer", "product_type", "date"])
.group_by(["customer", "product_type"])
.agg(
[
pl.col("date").alias("dates"),
pl.col("revenue").count().alias("total_deals"),
]
)
.with_columns(
[
# Calculate differences between consecutive dates
pl.col("dates")
.map_elements(
lambda dates: [
(dates[i] - dates[i - 1]).days
for i in range(1, len(dates))
]
if len(dates) > 1
else [],
return_dtype=pl.List(pl.Int64),
)
.alias("days_between_deals")
]
)
.with_columns(
[
# Calculate average duration between deals
pl.col("days_between_deals")
.map_elements(
lambda days_list: sum(days_list) / len(days_list)
if len(days_list) > 0
else None,
return_dtype=pl.Float64,
)
.alias("avg_days_between_deals")
]
)
.select(
[
"customer",
"product_type",
"total_deals",
"avg_days_between_deals",
]
)
.filter(
pl.col("total_deals") > 1
) # Only customers with multiple deals
.sort(["customer", "product_type"])
)
calculate_avg_duration_between_deals(df)
I deemed the AI’s solution as far too complicated for this basic problem. Therefore, I decided to use my own brain (and the polars user guide) to come up with something more streamlined.
Luckily the data can be easily formatted as dates using the str.to_datetime()
method.
df.with_columns(
pl.col.Date.str.to_datetime("%Y-%m-%d")
)
In contrast to the complicated version the AI gave us, we can simply use the .diff
method, which returns us a column with durations stored in a pl.List
.
Since we are dealing with durations, the datetime dt
namespace of polars is useful. It provides convenient methods for transforming the duration into minutes or, as in our case total_days()
.
Retrieving the average of the list is simple by calling the .mean()
method on the list.
I rounded the result afterward and gave the column a more fitting name, but that is it.
df.group_by(
"customer", "product_type"
).agg(
pl.col.Datum.diff().dt.total_days().alias("DiffDays"),
).with_columns(
pl.col.DiffDays.list.mean().round(1).alias("Avg_Days_Between_Deals")
)