Counting Conditional Occurrence in Pandas Dataframe

Dataframe to Count Conditional Occurrence

In this article, we’ll explore how to count conditional occurrence of data in a pandas dataframe. This involves identifying rows where certain conditions are met and then calculating the frequency of those occurrences.

Problem Statement

Given a sample dataframe with sales data, we want to find out when the sales were greater than 20 (in its previous 5 data) and how many times the inventory was greater than 10.

data = {'Date':     ["2018/12/29","2018/12/26","2018/12/24","2018/12/15","2018/12/11","2018/12/8","2018/11/28","2018/11/20","2018/11/19","2018/11/11","2018/11/6","2018/11/1","2018/10/28","2018/10/11","2018/9/25","2018/9/24"],
'Inventory': [5,5,5,22,5,25,5,15,15,5,5,15,0,22,2,10],
'Sales' : [0,36,18,0,0,17,18,17,34,16,0,0,18,18,51,18]}

df = pd.DataFrame(data)

Solution Using XLRD

The original question uses xlrd to read the sales data from an Excel file. However, this approach has several issues:

Manual Looping: The code uses a manual loop to iterate over each row and calculate the required values.
Inefficient Data Access: This approach involves accessing data in both directions (i.e., forward and backward) for each row.

Here’s how you can modify the original code to improve its efficiency:

import xlrd
from datetime import datetime

old_file = xlrd.open_workbook("C:\\Sales.xlsx")
the_sheet = old_file.sheet_by_name("Sales")

for row_index in range(1, the_sheet.nrows):
    Dates = the_sheet.cell(row_index, 0).value
    Inventory = the_sheet.cell(row_index, 1).value
    Sales = the_sheet.cell(row_index, 2).value

    list_of_Inventory = []

    for i in range(1,5):
        list_of_Inventory.append(the_sheet.cell(row_index - i, 1).value)

    if Sales > 20:
        print str(Dates) + " has Sales " + str(Sales) + " when " + str(sum(i > 10 for i in list_of_Inventory)) + " times."

However, this code still suffers from the same issues as mentioned above. A better approach would be to use pandas data structures and methods.

Solution Using Pandas

Here’s an improved solution using pandas:

import pandas as pd

data = {'Date':     ["2018/12/29","2018/12/26","2018/12/24","2018/12/15","2018/12/11","2018/12/8","2018/11/28","2018/11/20","2018/11/19","2018/11/11","2018/11/6","2018/11/1","2018/10/28","2018/10/11","2018/9/25","2018/9/24"],
'Inventory': [5,5,5,22,5,25,5,15,15,5,5,15,0,22,2,10],
'Sales' : [0,36,18,0,0,17,18,17,34,16,0,0,18,18,51,18]}

df = pd.DataFrame(data)

# Create a new column to track dates
df['Date'] = pd.to_datetime(df['Date'])

# Sort the dataframe by date
df.sort_values('Date', inplace=True)

for idx in df.loc[df.Sales > 20].index:
    inv = df.loc[idx-4:idx, 'Inventory'].ge(10)
    date, _, sales = df.loc[idx]
    if len(inv) >= 5:
        print(f'{date} has Sales {sales} when {inv.sum()} times')

This solution uses the pandas library to efficiently access and manipulate data. The Date column is converted to datetime format using pd.to_datetime(), which allows for efficient sorting and date-based filtering.

The code then iterates over rows with sales greater than 20, calculates the required values, and prints the results.

Explanation

Importing Libraries: We import the pandas library, which provides an efficient way to manipulate and analyze data.
Loading Data: The original question loads the data from an Excel file using xlrd, but we’ve switched to using a sample dataframe for this example.
Data Manipulation: We create a new column Date in the dataframe and convert it to datetime format using pd.to_datetime().
Sorting Dataframe: The dataframe is sorted by date using df.sort_values('Date', inplace=True'). This step ensures that the data is ordered chronologically.
Iterating Over Rows: We iterate over rows with sales greater than 20 using df.loc[df.Sales > 20].index.
Calculating Values: For each row, we calculate the required values: inv, which indicates whether inventory was greater than 10 in the previous 5 data points, and date and sales.
Printing Results: We print the results for each row with sales greater than 20, including the date, sales value, and frequency of inventory exceeding 10.

This approach is more efficient and scalable compared to the original solution using xlrd.

Last modified on 2023-08-10