Use all options of pandas DataFrame info()
Use all options of pandas DataFrame info()
When working with DataFrames, perhaps the most commonly used function to get information about your data is info()
. This function has a few parameters that can be useful, particularily when working with large datasets.
# Import libraries
import pandas as pd
%%bigquery df
# Get large sample data
SELECT *
FROM `bigquery-public-data.hacker_news.stories`
Without parameters
Printing information without any options will indicate a memory usage of 179MB. Also, note that it doesn’t display the number of non-Null values for each column, because this dataset is too large.
# Basic usage without parameters
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1959809 entries, 0 to 1959808
Data columns (total 12 columns):
# Column Dtype
--- ------ -----
0 id int64
1 by object
2 score float64
3 time float64
4 time_ts datetime64[ns, UTC]
5 title object
6 url object
7 text object
8 deleted object
9 dead object
10 descendants float64
11 author object
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(7)
memory usage: 179.4+ MB
Real memory usage
Specifying memory_usage='deep'
will enable deep memory introspection, and show real memory usage, which is often much higher than the standard estimation. In this example, it is five times higher than previously estimated.
# Full memory usage
df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1959809 entries, 0 to 1959808
Data columns (total 12 columns):
# Column Dtype
--- ------ -----
0 id int64
1 by object
2 score float64
3 time float64
4 time_ts datetime64[ns, UTC]
5 title object
6 url object
7 text object
8 deleted object
9 dead object
10 descendants float64
11 author object
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(7)
memory usage: 1007.3 MB
Force count of Null values
On large datasets, counting Null values in each columns is deactivated, to avoid lengthy calculation. To force the display, set show_counts=True
.
# Force Null values counting
df.info(show_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1959809 entries, 0 to 1959808
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1959809 non-null int64
1 by 1841269 non-null object
2 score 1841269 non-null float64
3 time 1934088 non-null float64
4 time_ts 1934088 non-null datetime64[ns, UTC]
5 title 1841267 non-null object
6 url 1839757 non-null object
7 text 1425167 non-null object
8 deleted 92818 non-null object
9 dead 394344 non-null object
10 descendants 1741598 non-null float64
11 author 1841269 non-null object
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(7)
memory usage: 179.4+ MB
Compact mode
If you don’t want to display the full summary with details about each columns, use the compact mode with verbose=False
.
# Non verbose mode
df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1959809 entries, 0 to 1959808
Columns: 12 entries, id to author
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(7)
memory usage: 179.4+ MB