Pandas - Convert DataFrame to List of Lists
• Converting DataFrames to lists of lists is a fundamental operation for data serialization, API responses, and interfacing with non-pandas libraries that expect nested list structures
Key Insights
• Converting DataFrames to lists of lists is a fundamental operation for data serialization, API responses, and interfacing with non-pandas libraries that expect nested list structures
• The values.tolist() method provides the fastest conversion, while alternatives like to_numpy().tolist() offer better compatibility with nullable data types and extension arrays
• Understanding when to include headers, how to handle indexes, and performance implications ensures you choose the right approach for your specific use case
Basic Conversion with values.tolist()
The most straightforward method to convert a DataFrame to a list of lists uses the values attribute combined with tolist(). This approach returns only the data values, excluding column names and index.
import pandas as pd
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard'],
'price': [999.99, 29.99, 79.99],
'quantity': [5, 50, 25]
})
# Convert to list of lists
data_list = df.values.tolist()
print(data_list)
# [['Laptop', 999.99, 5], ['Mouse', 29.99, 50], ['Keyboard', 79.99, 25]]
Each inner list represents a row, with elements ordered according to the DataFrame’s column sequence. This method is memory-efficient and performs well on large datasets since it directly accesses the underlying NumPy array.
Using to_numpy() for Better Type Handling
The to_numpy() method, introduced in pandas 0.24.0, provides more robust handling of nullable data types and extension arrays compared to the older values attribute.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['Alice', 'Bob', None],
'age': pd.array([25, 30, 35], dtype='Int64'), # Nullable integer
'score': [88.5, np.nan, 92.0]
})
# Using to_numpy() preserves nullable types better
data_list = df.to_numpy().tolist()
print(data_list)
# [['Alice', 25, 88.5], ['Bob', 30, nan], [None, 35, 92.0]]
# Compare with values (may produce unexpected results with extension dtypes)
values_list = df.values.tolist()
print(values_list)
The to_numpy() method explicitly converts extension arrays to NumPy arrays before creating lists, ensuring consistent behavior across different pandas data types.
Including Column Headers
Many use cases require the column names as the first element in your list structure. You can prepend headers using list concatenation or comprehension.
import pandas as pd
df = pd.DataFrame({
'id': [1, 2, 3],
'status': ['active', 'pending', 'inactive'],
'amount': [100.0, 250.5, 75.25]
})
# Method 1: Simple concatenation
headers = [df.columns.tolist()]
data_with_headers = headers + df.values.tolist()
print(data_with_headers)
# [['id', 'status', 'amount'], [1, 'active', 100.0], [2, 'pending', 250.5], [3, 'inactive', 75.25]]
# Method 2: Using list comprehension for more control
result = [df.columns.tolist()] + [row.tolist() for _, row in df.iterrows()]
The first method is more performant for large DataFrames since it avoids iterating through rows. Reserve the iterrows() approach for cases where you need row-level transformation logic.
Handling Index Values
If your DataFrame has a meaningful index that should be included in the output, you need to reset it or explicitly include it in the conversion.
import pandas as pd
df = pd.DataFrame({
'revenue': [10000, 15000, 12000],
'expenses': [7000, 9000, 8000]
}, index=['Q1', 'Q2', 'Q3'])
# Method 1: Reset index to include it as a column
data_with_index = df.reset_index().values.tolist()
print(data_with_index)
# [['Q1', 10000, 7000], ['Q2', 15000, 9000], ['Q3', 12000, 8000]]
# Method 2: Manual index inclusion
data_manual = [[idx] + row for idx, row in zip(df.index, df.values.tolist())]
print(data_manual)
# Method 3: Include index name in headers
headers = [['quarter'] + df.columns.tolist()]
data_complete = headers + df.reset_index().values.tolist()
print(data_complete)
# [['quarter', 'revenue', 'expenses'], ['Q1', 10000, 7000], ['Q2', 15000, 9000], ['Q3', 12000, 8000]]
The reset_index() approach is cleanest when you want the index treated as a regular column. For multi-level indexes, this method automatically creates separate columns for each index level.
Column-Oriented Lists
Sometimes you need data organized by columns rather than rows. This transposed structure is useful for certain plotting libraries and data processing pipelines.
import pandas as pd
df = pd.DataFrame({
'x': [1, 2, 3, 4],
'y': [10, 20, 30, 40],
'z': [100, 200, 300, 400]
})
# Convert each column to a list
column_lists = [df[col].tolist() for col in df.columns]
print(column_lists)
# [[1, 2, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]]
# Alternative: Using values.T.tolist()
transposed = df.values.T.tolist()
print(transposed)
# [[1, 2, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]]
# Create a dictionary mapping column names to lists
column_dict = {col: df[col].tolist() for col in df.columns}
print(column_dict)
# {'x': [1, 2, 3, 4], 'y': [10, 20, 30, 40], 'z': [100, 200, 300, 400]}
The transpose approach (values.T.tolist()) is faster for large DataFrames with many columns, while the dictionary comprehension provides named access to each column’s data.
Performance Considerations
When working with large DataFrames, the conversion method you choose impacts performance significantly.
import pandas as pd
import numpy as np
import time
# Create a large DataFrame
n_rows = 100000
df = pd.DataFrame({
'col1': np.random.randint(0, 100, n_rows),
'col2': np.random.random(n_rows),
'col3': np.random.choice(['A', 'B', 'C'], n_rows)
})
# Benchmark different approaches
start = time.time()
result1 = df.values.tolist()
print(f"values.tolist(): {time.time() - start:.4f}s")
start = time.time()
result2 = df.to_numpy().tolist()
print(f"to_numpy().tolist(): {time.time() - start:.4f}s")
start = time.time()
result3 = [row.tolist() for _, row in df.iterrows()]
print(f"iterrows(): {time.time() - start:.4f}s")
start = time.time()
result4 = df.apply(lambda x: x.tolist(), axis=1).tolist()
print(f"apply(): {time.time() - start:.4f}s")
Avoid iterrows() and apply() for simple conversions—they’re 10-100x slower than vectorized methods. Use values.tolist() for homogeneous data types and to_numpy().tolist() when working with nullable or extension dtypes.
Handling Missing Data
Missing values require special attention during conversion, as NaN and None behave differently in Python lists.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [1, 2, np.nan, 4],
'b': ['x', None, 'z', 'w'],
'c': [True, False, None, True]
})
# Default conversion preserves NaN
basic_list = df.values.tolist()
print(basic_list)
# [[1.0, 'x', True], [2.0, None, False], [nan, 'z', None], [4.0, 'w', True]]
# Replace NaN with None for JSON compatibility
clean_list = df.where(pd.notna(df), None).values.tolist()
print(clean_list)
# [[1.0, 'x', True], [2.0, None, False], [None, 'z', None], [4.0, 'w', True]]
# Fill missing values before conversion
filled_list = df.fillna({'a': 0, 'b': '', 'c': False}).values.tolist()
print(filled_list)
# [[1.0, 'x', True], [2.0, '', False], [0.0, 'z', False], [4.0, 'w', True]]
For JSON serialization or API responses, convert NaN to None using where(pd.notna(df), None) to ensure compatibility with JSON standards that don’t recognize NaN values.