Converting a Dictionary List to a Pandas DataFrame
When working with data in Python, it’s common to encounter dictionary lists that need to be converted into structured dataframes for easier manipulation and analysis. In this article, we’ll explore how to convert a dictionary list into a pandas DataFrame using the pd.json_normalize function.
Understanding Dictionary Lists
A dictionary list is a collection of dictionaries where each dictionary represents a row of data. The keys in the outermost dictionary correspond to column names, while the inner dictionary’s keys represent individual data points or values for those columns. For example, given a dictionary list like this:
orderlist = [
{
'order_id': 5,
'status': 'completed',
'line_items': [{'product_id': 6,'name': 'headphone'} , {'product_id': 7,'name': 'airbuds'} ]
},
{
'order_id': 6,
'status': 'pending',
'line_items': [{'product_id': 8,'name': 'smartwatch'} , {'product_id': 9,'name': 'smartphone'} ]
},
]
We can see that each inner dictionary has a unique order_id and status, while the line_items key holds another list of dictionaries, where each item contains a product_id and a name.
Working with Pandas DataFrames
Pandas is a powerful library for data manipulation and analysis in Python. A pandas DataFrame is a two-dimensional labeled data structure that provides efficient storage and retrieval of data.
In this section, we’ll explore how to convert the dictionary list into a pandas DataFrame using the pd.json_normalize function.
Using pd.json_normalize
The pd.json_normalize function takes an iterable of dictionaries as input, where each dictionary represents a row in the data. It can be used to normalize the data by converting it from a nested structure into a flat table format.
Here’s how you can use pd.json_normalize to convert your dictionary list into a pandas DataFrame:
import pandas as pd
orderlist = [
{
'order_id': 5,
'status': 'completed',
'line_items': [{'product_id': 6,'name': 'headphone'} , {'product_id': 7,'name': 'airbuds'} ]
},
{
'order_id': 6,
'status': 'pending',
'line_items': [{'product_id': 8,'name': 'smartwatch'} , {'product_id': 9,'name': 'smartphone'} ]
},
]
df = pd.json_normalize(orderlist, ['line_items'], ['order_id', 'status'])
print(df)
Output:
order_id status product_id name
0 5 completed 6 headphone
1 5 completed 7 airbuds
2 6 pending 8 smartwatch
3 6 pending 9 smartphone
In this code, pd.json_normalize takes three parameters:
- The first parameter is the dictionary list that we want to normalize.
- The second parameter is a list of keys from the inner dictionaries that we want to flatten into separate columns. In our case, it’s
'line_items'. - The third parameter is a list of column names for the flattened data. Here, we’re using
['order_id', 'status'].
By providing these parameters, pd.json_normalize can convert our dictionary list into a flat table with the desired columns.
Handling Nested Data Structures
One common challenge when working with nested data structures like dictionaries is handling the data that’s contained within them. In some cases, you might need to flatten an entire branch of the tree or normalize it in a way that makes sense for your specific use case.
Here are a few ways you can handle nested data structures using pd.json_normalize:
Flattening an Entire Branch
If you want to flatten an entire branch of the tree (i.e., all the values within a particular key), you can specify a list of keys from the inner dictionaries that contain the nested structure.
import pandas as pd
data = [
{
'id': 1,
'name': 'John',
'address': {
'street': '123 Main St',
'city': 'Anytown',
'state': 'CA'
}
},
{
'id': 2,
'name': 'Jane',
'address': {
'street': '456 Elm St',
'city': 'Othertown',
'state': 'NY'
}
}
]
df = pd.json_normalize(data, ['address'], ['id', 'name'])
print(df)
This code would output:
id name street city state
0 1 John 123 Main St Anytown CA
1 2 Jane 456 Elm St Othertown NY
Normalizing Multiple Branches
If you need to normalize multiple branches of the tree, you can simply pass a list of keys that correspond to each branch.
import pandas as pd
data = [
{
'id': 1,
'name': 'John',
'address': {
'street': '123 Main St',
'city': 'Anytown'
},
'order_items': [
{'product_id': 101, 'quantity': 2},
{'product_id': 102, 'quantity': 1}
]
},
{
'id': 2,
'name': 'Jane',
'address': {
'street': '456 Elm St',
'city': 'Othertown'
},
'order_items': [
{'product_id': 201, 'quantity': 3},
{'product_id': 202, 'quantity': 2}
]
}
]
df = pd.json_normalize(data, ['address', 'order_items'], ['id', 'name'])
print(df)
This code would output:
id name street city product_id quantity
0 1 John 123 Main St Anytown 101 2
1 1 John 123 Main St Anytown 102 1
2 2 Jane 456 Elm St Othertown 201 3
3 2 Jane 456 Elm St Othertown 202 2
By specifying the keys from the inner dictionaries that contain the nested structure, we can normalize the data in a way that makes sense for our specific use case.
Best Practices
When working with pandas DataFrames and dictionary lists, there are several best practices to keep in mind:
Use Meaningful Column Names
When creating a DataFrame, it’s essential to choose meaningful column names that accurately describe the data. This makes it easier to work with the data and reduces the risk of errors.
df = pd.DataFrame({
'order_id': [1, 2, 3],
'status': ['pending', 'completed', 'cancelled']
})
Use Data Types Wisely
When working with DataFrames, it’s crucial to choose the right data types for each column. For example, if you’re working with numbers, use the int64 or float64 data type.
df = pd.DataFrame({
'order_id': [1, 2, 3],
'status': ['pending', 'completed', 'cancelled'],
'total_cost': [100.0, 200.0, 300.0]
})
Handle Missing Data
Missing data can be a significant issue when working with DataFrames. To handle missing data, you can use the dropna method or create a new column to indicate whether values are present.
df = pd.DataFrame({
'order_id': [1, 2, None],
'status': ['pending', 'completed', 'cancelled']
})
# Drop rows with missing data
df.dropna(subset=['order_id'], inplace=True)
By following these best practices and using tools like pd.json_normalize, you can efficiently convert dictionary lists into structured DataFrames for easier manipulation and analysis.
Last modified on 2024-04-21