Converting a Dictionary List to a Pandas DataFrame

When working with data in Python, it’s common to encounter dictionary lists that need to be converted into structured dataframes for easier manipulation and analysis. In this article, we’ll explore how to convert a dictionary list into a pandas DataFrame using the pd.json_normalize function.

Understanding Dictionary Lists

A dictionary list is a collection of dictionaries where each dictionary represents a row of data. The keys in the outermost dictionary correspond to column names, while the inner dictionary’s keys represent individual data points or values for those columns. For example, given a dictionary list like this:

orderlist = [
    {
        'order_id': 5,
        'status': 'completed',
        'line_items': [{'product_id': 6,'name': 'headphone'} , {'product_id': 7,'name': 'airbuds'} ]
    },

    {
        'order_id': 6,
        'status': 'pending',
        'line_items': [{'product_id': 8,'name': 'smartwatch'} , {'product_id': 9,'name': 'smartphone'} ]
    },
]

We can see that each inner dictionary has a unique order_id and status, while the line_items key holds another list of dictionaries, where each item contains a product_id and a name.

Working with Pandas DataFrames

Pandas is a powerful library for data manipulation and analysis in Python. A pandas DataFrame is a two-dimensional labeled data structure that provides efficient storage and retrieval of data.

In this section, we’ll explore how to convert the dictionary list into a pandas DataFrame using the pd.json_normalize function.

Using pd.json_normalize

The pd.json_normalize function takes an iterable of dictionaries as input, where each dictionary represents a row in the data. It can be used to normalize the data by converting it from a nested structure into a flat table format.

Here’s how you can use pd.json_normalize to convert your dictionary list into a pandas DataFrame:

import pandas as pd

orderlist = [
    {
        'order_id': 5,
        'status': 'completed',
        'line_items': [{'product_id': 6,'name': 'headphone'} , {'product_id': 7,'name': 'airbuds'} ]
    },

    {
        'order_id': 6,
        'status': 'pending',
        'line_items': [{'product_id': 8,'name': 'smartwatch'} , {'product_id': 9,'name': 'smartphone'} ]
    },
]

df = pd.json_normalize(orderlist, ['line_items'], ['order_id', 'status'])
print(df)

Output:

   order_id     status product_id      name
0         5  completed           6  headphone
1         5  completed           7   airbuds
2         6    pending           8  smartwatch
3         6    pending           9  smartphone

In this code, pd.json_normalize takes three parameters:

The first parameter is the dictionary list that we want to normalize.
The second parameter is a list of keys from the inner dictionaries that we want to flatten into separate columns. In our case, it’s 'line_items'.
The third parameter is a list of column names for the flattened data. Here, we’re using ['order_id', 'status'].

By providing these parameters, pd.json_normalize can convert our dictionary list into a flat table with the desired columns.

Handling Nested Data Structures

One common challenge when working with nested data structures like dictionaries is handling the data that’s contained within them. In some cases, you might need to flatten an entire branch of the tree or normalize it in a way that makes sense for your specific use case.

Here are a few ways you can handle nested data structures using pd.json_normalize:

Flattening an Entire Branch

If you want to flatten an entire branch of the tree (i.e., all the values within a particular key), you can specify a list of keys from the inner dictionaries that contain the nested structure.

import pandas as pd

data = [
    {
        'id': 1,
        'name': 'John',
        'address': {
            'street': '123 Main St',
            'city': 'Anytown',
            'state': 'CA'
        }
    },
    {
        'id': 2,
        'name': 'Jane',
        'address': {
            'street': '456 Elm St',
            'city': 'Othertown',
            'state': 'NY'
        }
    }
]

df = pd.json_normalize(data, ['address'], ['id', 'name'])
print(df)

This code would output:

   id     name           street          city          state
0   1    John  123 Main St      Anytown         CA
1   2    Jane  456 Elm St  Othertown       NY

Normalizing Multiple Branches

If you need to normalize multiple branches of the tree, you can simply pass a list of keys that correspond to each branch.

import pandas as pd

data = [
    {
        'id': 1,
        'name': 'John',
        'address': {
            'street': '123 Main St',
            'city': 'Anytown'
        },
        'order_items': [
            {'product_id': 101, 'quantity': 2},
            {'product_id': 102, 'quantity': 1}
        ]
    },
    {
        'id': 2,
        'name': 'Jane',
        'address': {
            'street': '456 Elm St',
            'city': 'Othertown'
        },
        'order_items': [
            {'product_id': 201, 'quantity': 3},
            {'product_id': 202, 'quantity': 2}
        ]
    }
]

df = pd.json_normalize(data, ['address', 'order_items'], ['id', 'name'])
print(df)

This code would output:

   id     name           street          city product_id quantity
0   1    John  123 Main St      Anytown         101        2
1   1    John  123 Main St      Anytown         102        1
2   2    Jane  456 Elm St  Othertown         201        3
3   2    Jane  456 Elm St  Othertown         202        2

By specifying the keys from the inner dictionaries that contain the nested structure, we can normalize the data in a way that makes sense for our specific use case.

Best Practices

When working with pandas DataFrames and dictionary lists, there are several best practices to keep in mind:

Use Meaningful Column Names

When creating a DataFrame, it’s essential to choose meaningful column names that accurately describe the data. This makes it easier to work with the data and reduces the risk of errors.

df = pd.DataFrame({
    'order_id': [1, 2, 3],
    'status': ['pending', 'completed', 'cancelled']
})

Use Data Types Wisely

When working with DataFrames, it’s crucial to choose the right data types for each column. For example, if you’re working with numbers, use the int64 or float64 data type.

df = pd.DataFrame({
    'order_id': [1, 2, 3],
    'status': ['pending', 'completed', 'cancelled'],
    'total_cost': [100.0, 200.0, 300.0]
})

Handle Missing Data

Missing data can be a significant issue when working with DataFrames. To handle missing data, you can use the dropna method or create a new column to indicate whether values are present.

df = pd.DataFrame({
    'order_id': [1, 2, None],
    'status': ['pending', 'completed', 'cancelled']
})

# Drop rows with missing data
df.dropna(subset=['order_id'], inplace=True)

By following these best practices and using tools like pd.json_normalize, you can efficiently convert dictionary lists into structured DataFrames for easier manipulation and analysis.

Last modified on 2024-04-21