TWIL - Calculate days between sales using polars

The solution I settled with in the end was:

df.group_by(
	"customer", "product_type"
).agg(
	pl.col.date.diff().dt.total_days().alias("DiffDays"),
).with_columns(
	pl.col.DiffDays.list.mean().round(1).alias("Avg_Days_Between_Deals")
).drop("DiffDays")

I am currently doing quite a bit of revenue data analysis at my job. The other day my manager came up to me and told me that he wanted to add additional KPIs to the analysis. The given dataset looks like that:

date [str]	product_type [str]	customer [str]	revenue [float64]
2024-05-10	Product A	Example Corp	3400.0
2024-05-10	Product A	Beispiel GmbH	350.9
…	…	…	…

To get a first idea, let’s vibe code this example and see what Claude Sonnet 4 came up with:

def calculate_avg_duration_between_deals(df):
    """
    Calculate average duration between consecutive deals for each customer-product combination
    """
    return (
        df.sort(["customer", "product_type", "date"])
        .group_by(["customer", "product_type"])
        .agg(
            [
                pl.col("date").alias("dates"),
                pl.col("revenue").count().alias("total_deals"),
            ]
        )
        .with_columns(
            [
                # Calculate differences between consecutive dates
                pl.col("dates")
                .map_elements(
                    lambda dates: [
                        (dates[i] - dates[i - 1]).days
                        for i in range(1, len(dates))
                    ]
                    if len(dates) > 1
                    else [],
                    return_dtype=pl.List(pl.Int64),
                )
                .alias("days_between_deals")
            ]
        )
        .with_columns(
            [
                # Calculate average duration between deals
                pl.col("days_between_deals")
                .map_elements(
                    lambda days_list: sum(days_list) / len(days_list)
                    if len(days_list) > 0
                    else None,
                    return_dtype=pl.Float64,
                )
                .alias("avg_days_between_deals")
            ]
        )
        .select(
            [
                "customer",
                "product_type",
                "total_deals",
                "avg_days_between_deals",
            ]
        )
        .filter(
            pl.col("total_deals") > 1
        )  # Only customers with multiple deals
        .sort(["customer", "product_type"])
    )


calculate_avg_duration_between_deals(df)

I deemed the AI’s solution as far too complicated for this basic problem. Therefore, I decided to use my own brain (and the polars user guide) to come up with something more streamlined.

Luckily the data can be easily formatted as dates using the str.to_datetime() method.

df.with_columns(
	pl.col.Date.str.to_datetime("%Y-%m-%d")
)

In contrast to the complicated version the AI gave us, we can simply use the .diff method, which returns us a column with durations stored in a pl.List. Since we are dealing with durations, the datetime dt namespace of polars is useful. It provides convenient methods for transforming the duration into minutes or, as in our case total_days(). Retrieving the average of the list is simple by calling the .mean() method on the list. I rounded the result afterward and gave the column a more fitting name, but that is it.

df.group_by(
	"customer", "product_type"
).agg(
	pl.col.Datum.diff().dt.total_days().alias("DiffDays"),
).with_columns(
	pl.col.DiffDays.list.mean().round(1).alias("Avg_Days_Between_Deals")
)