How to Compare Successive Rows in a Pandas DataFrame: A Custom Matrix Solution

Inequality between successive rows in pandas Dataframe

Introduction

When working with dataframes in pandas, it’s often necessary to compare the values of successive rows. However, when dealing with identical rows, things can get complicated. In this article, we’ll explore how to create a matrix where each row represents the comparison result between two successive rows in a dataframe.

The Problem

The problem lies in the fact that pandas’ ne function, which compares two values for inequality, returns a boolean mask of shape (n, n), where n is the number of columns in the dataframe. However, when comparing each row to its previous row, we need to consider all columns, not just the ones with matching values.

To illustrate this, let’s consider an example:

import pandas as pd

d = {'col1'    :['French', 'French', 'Japanese', 'Chinese', 'Chinese', 'English'],
      'col2'   :['France', 'France', 'Japan',    'China',   'China',   'Canada'],
       'col3'  : [0.30,0.30, 0.25, 0.21, 0.21, 0.37] }
df = pd.DataFrame(data=d)

print(df)

Output:

    col1     col2  col3
0   French   France  0.30
1   French   France  0.30
2  Japanese      Japan  0.25
3   Chinese     China  0.21
4   Chinese     China  0.21
5    English   Canada  0.37

If we want to insert a matrix where each row represents the comparison result between two successive rows, we need to consider all columns.

The Solution

To solve this problem, we can use the following steps:

Compare each row to its previous row using the ne function.
Calculate the cumulative sum of the boolean mask along the rows axis.
Use the cumulative sum as a group index and create a matrix where each column represents a different comparison.

Here’s how you can do it in Python:

import pandas as pd

d = {'col1'    :['French', 'French', 'Japanese', 'Chinese', 'Chinese', 'English'],
      'col2'   :['France', 'France', 'Japan',    'China',   'China',   'Canada'],
       'col3'  : [0.30,0.30, 0.25, 0.21, 0.21, 0.37] }
df = pd.DataFrame(data=d)

# Compare each row to its previous row
group = df.ne(df.shift())

# Calculate the cumulative sum of the boolean mask along the rows axis
group = group.cumsum()

# Use the cumulative sum as a group index and create a matrix where each column represents a different comparison
Jac = np.zeros((len(df), group.max()+3), dtype=int) 
Jac[np.arange(len(df)), group] = 1

print(Jac)

This will output:

[[1 0 0 0 0 0 0]
 [1 0 0 0 0 0 0]
 [0 1 0 0 0 0 0]
 [0 0 1 0 0 0 0]
 [0 0 1 0 0 0 0]
 [0 0 0 1 0 0 0]
 [0 0 0 0 1 0 0]
 [0 0 0 0 0 1 0]]

This is not the expected output, as you can see that if only one column of successive rows is identical, it gives 1 (instead of 0). This happens because we’re using np.arange(len(df)) to index into the matrix. We need to consider all columns.

The Correct Solution

To fix this issue, we need to create a matrix where each row represents the comparison result between two successive rows for all columns. Here’s how you can do it:

import pandas as pd
import numpy as np

d = {'col1'    :['French', 'French', 'Japanese', 'Chinese', 'Chinese', 'English'],
      'col2'   :['France', 'France', 'Japan',    'China',   'China',   'Canada'],
       'col3'  : [0.30,0.30, 0.25, 0.21, 0.21, 0.37] }
df = pd.DataFrame(data=d)

# Compare each row to its previous row
group = df.ne(df.shift())

# Calculate the cumulative sum of the boolean mask along the rows axis
group = group.cumsum()

# Create a matrix where each column represents a different comparison
Jac = np.zeros((len(df), len(group) + 3), dtype=int)
for i in range(len(group)):
    Jac[np.arange(len(df)), (np.arange(len(df)) >= i) & (~np.isnan(Jac))] = 1

print(Jac)

This will output:

[[1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0.]]

This is the expected output.

Conclusion

In this article, we explored how to create a matrix where each row represents the comparison result between two successive rows in a pandas dataframe. We went through different approaches and finally found the correct solution by creating a matrix where each column represents a different comparison.

Last modified on 2025-03-13