- class EncoderDataFrame(*args, **kwargs)[source]
Bases:
pandas.core.frame.DataFrame
- Attributes
- T
at
attrs
axes
- columns
dtypes
empty
flags
iat
iloc
- index
loc
ndim
shape
size
style
values
Access a single value for a row/column label pair.
Dictionary of global attributes of this dataset.
Return a list representing the axes of the DataFrame.
The column labels of the DataFrame.
Return the dtypes in the DataFrame.
Indicator whether DataFrame is empty.
Get the properties associated with this pandas object.
Access a single value for a row/column pair by integer position.
Purely integer-location based indexing for selection by position.
The index (row labels) of the DataFrame.
Access a group of rows and columns by label(s) or a boolean array.
Return an int representing the number of axes / array dimensions.
Return a tuple representing the dimensionality of the DataFrame.
Return an int representing the number of elements in this object.
Returns a Styler object.
Return a Numpy representation of the DataFrame.
Methods
abs
()Return a Series/DataFrame with absolute numeric value of each element.
add
(other[, axis, level, fill_value])Get Addition of dataframe and other, element-wise (binary operator
add
).add_prefix
(prefix)Prefix labels with string
prefix
.add_suffix
(suffix)Suffix labels with string
suffix
.agg
([func, axis])Aggregate using one or more operations over the specified axis.
aggregate
([func, axis])Aggregate using one or more operations over the specified axis.
align
(other[, join, axis, level, copy, ...])Align two objects on their axes with the specified join method.
all
([axis, bool_only, skipna, level])Return whether all elements are True, potentially over an axis.
any
([axis, bool_only, skipna, level])Return whether any element is True, potentially over an axis.
append
(other[, ignore_index, ...])Append rows of
other
to the end of caller, returning a new object.apply
(func[, axis, raw, result_type, args])Apply a function along an axis of the DataFrame.
applymap
(func[, na_action])Apply a function to a Dataframe elementwise.
asfreq
(freq[, method, how, normalize, ...])Convert time series to specified frequency.
asof
(where[, subset])Return the last row(s) without any NaNs before
where
.assign
(**kwargs)Assign new columns to a DataFrame.
astype
(dtype[, copy, errors])Cast a pandas object to a specified dtype
dtype
.at_time
(time[, asof, axis])Select values at particular time of day (e.g., 9:30AM).
backfill
([axis, inplace, limit, downcast])Synonym for
DataFrame.fillna()
withmethod='bfill'
.between_time
(start_time, end_time[, ...])Select values between particular times of the day (e.g., 9:00-9:30 AM).
bfill
([axis, inplace, limit, downcast])Synonym for
DataFrame.fillna()
withmethod='bfill'
.bool
()Return the bool of a single element Series or DataFrame.
boxplot
([column, by, ax, fontsize, rot, ...])Make a box plot from DataFrame columns.
clip
([lower, upper, axis, inplace])Trim values at input threshold(s).
combine
(other, func[, fill_value, overwrite])Perform column-wise combine with another DataFrame.
combine_first
(other)Update null elements with value in the same location in
other
.compare
(other[, align_axis, keep_shape, ...])Compare to another DataFrame and show the differences.
convert_dtypes
([infer_objects, ...])Convert columns to best possible dtypes using dtypes supporting
pd.NA
.copy
([deep])Make a copy of this object's indices and data.
corr
([method, min_periods])Compute pairwise correlation of columns, excluding NA/null values.
corrwith
(other[, axis, drop, method])Compute pairwise correlation.
count
([axis, level, numeric_only])Count non-NA cells for each column or row.
cov
([min_periods, ddof])Compute pairwise covariance of columns, excluding NA/null values.
cummax
([axis, skipna])Return cumulative maximum over a DataFrame or Series axis.
cummin
([axis, skipna])Return cumulative minimum over a DataFrame or Series axis.
cumprod
([axis, skipna])Return cumulative product over a DataFrame or Series axis.
cumsum
([axis, skipna])Return cumulative sum over a DataFrame or Series axis.
describe
([percentiles, include, exclude, ...])Generate descriptive statistics.
diff
([periods, axis])First discrete difference of element.
div
(other[, axis, level, fill_value])Get Floating division of dataframe and other, element-wise (binary operator
truediv
).divide
(other[, axis, level, fill_value])Get Floating division of dataframe and other, element-wise (binary operator
truediv
).dot
(other)Compute the matrix multiplication between the DataFrame and other.
drop
([labels, axis, index, columns, level, ...])Drop specified labels from rows or columns.
drop_duplicates
([subset, keep, inplace, ...])Return DataFrame with duplicate rows removed.
droplevel
(level[, axis])Return Series/DataFrame with requested index / column level(s) removed.
dropna
([axis, how, thresh, subset, inplace])Remove missing values.
duplicated
([subset, keep])Return boolean Series denoting duplicate rows.
eq
(other[, axis, level])Get Equal to of dataframe and other, element-wise (binary operator
eq
).equals
(other)Test whether two objects contain the same elements.
eval
(expr[, inplace])Evaluate a string describing operations on DataFrame columns.
ewm
([com, span, halflife, alpha, ...])Provide exponential weighted (EW) functions.
expanding
([min_periods, center, axis, method])Provide expanding transformations.
explode
(column[, ignore_index])Transform each element of a list-like to a row, replicating index values.
ffill
([axis, inplace, limit, downcast])Synonym for
DataFrame.fillna()
withmethod='ffill'
.fillna
([value, method, axis, inplace, ...])Fill NA/NaN values using the specified method.
filter
([items, like, regex, axis])Subset the dataframe rows or columns according to the specified index labels.
first
(offset)Select initial periods of time series data based on a date offset.
Return index for first non-NA value or None, if no NA value is found.
floordiv
(other[, axis, level, fill_value])Get Integer division of dataframe and other, element-wise (binary operator
floordiv
).from_dict
(data[, orient, dtype, columns])Construct DataFrame from dict of array-like or dicts.
from_records
(data[, index, exclude, ...])Convert structured or record ndarray to DataFrame.
ge
(other[, axis, level])Get Greater than or equal to of dataframe and other, element-wise (binary operator
ge
).get
(key[, default])Get item from object for given key (ex: DataFrame column).
groupby
([by, axis, level, as_index, sort, ...])Group DataFrame using a mapper or by a Series of columns.
gt
(other[, axis, level])Get Greater than of dataframe and other, element-wise (binary operator
gt
).head
([n])Return the first
n
rows.hist
([column, by, grid, xlabelsize, xrot, ...])Make a histogram of the DataFrame's columns.
idxmax
([axis, skipna])Return index of first occurrence of maximum over requested axis.
idxmin
([axis, skipna])Return index of first occurrence of minimum over requested axis.
Attempt to infer better dtypes for object columns.
info
([verbose, buf, max_cols, memory_usage, ...])Print a concise summary of a DataFrame.
insert
(loc, column, value[, allow_duplicates])Insert column into DataFrame at specified location.
interpolate
([method, axis, limit, inplace, ...])Fill NaN values using an interpolation method.
isin
(values)Whether each element in the DataFrame is contained in values.
isna
()Detect missing values.
isnull
()Detect missing values.
items
()Iterate over (column name, Series) pairs.
Iterate over (column name, Series) pairs.
iterrows
()Iterate over DataFrame rows as (index, Series) pairs.
itertuples
([index, name])Iterate over DataFrame rows as namedtuples.
join
(other[, on, how, lsuffix, rsuffix, sort])Join columns of another DataFrame.
keys
()Get the 'info axis' (see Indexing for more).
kurt
([axis, skipna, level, numeric_only])Return unbiased kurtosis over requested axis.
kurtosis
([axis, skipna, level, numeric_only])Return unbiased kurtosis over requested axis.
last
(offset)Select final periods of time series data based on a date offset.
Return index for last non-NA value or None, if no NA value is found.
le
(other[, axis, level])Get Less than or equal to of dataframe and other, element-wise (binary operator
le
).lookup
(row_labels, col_labels)Label-based "fancy indexing" function for DataFrame.
lt
(other[, axis, level])Get Less than of dataframe and other, element-wise (binary operator
lt
).mad
([axis, skipna, level])Return the mean absolute deviation of the values over the requested axis.
mask
(cond[, other, inplace, axis, level, ...])Replace values where the condition is True.
max
([axis, skipna, level, numeric_only])Return the maximum of the values over the requested axis.
mean
([axis, skipna, level, numeric_only])Return the mean of the values over the requested axis.
median
([axis, skipna, level, numeric_only])Return the median of the values over the requested axis.
melt
([id_vars, value_vars, var_name, ...])Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
memory_usage
([index, deep])Return the memory usage of each column in bytes.
merge
(right[, how, on, left_on, right_on, ...])Merge DataFrame or named Series objects with a database-style join.
min
([axis, skipna, level, numeric_only])Return the minimum of the values over the requested axis.
mod
(other[, axis, level, fill_value])Get Modulo of dataframe and other, element-wise (binary operator
mod
).mode
([axis, numeric_only, dropna])Get the mode(s) of each element along the selected axis.
mul
(other[, axis, level, fill_value])Get Multiplication of dataframe and other, element-wise (binary operator
mul
).multiply
(other[, axis, level, fill_value])Get Multiplication of dataframe and other, element-wise (binary operator
mul
).ne
(other[, axis, level])Get Not equal to of dataframe and other, element-wise (binary operator
ne
).nlargest
(n, columns[, keep])Return the first
n
rows ordered bycolumns
in descending order.notna
()Detect existing (non-missing) values.
notnull
()Detect existing (non-missing) values.
nsmallest
(n, columns[, keep])Return the first
n
rows ordered bycolumns
in ascending order.nunique
([axis, dropna])Count number of distinct elements in specified axis.
pad
([axis, inplace, limit, downcast])Synonym for
DataFrame.fillna()
withmethod='ffill'
.pct_change
([periods, fill_method, limit, freq])Percentage change between the current and a prior element.
pipe
(func, *args, **kwargs)Apply func(self, *args, **kwargs).
pivot
([index, columns, values])Return reshaped DataFrame organized by given index / column values.
pivot_table
([values, index, columns, ...])Create a spreadsheet-style pivot table as a DataFrame.
alias of
pandas.plotting._core.PlotAccessor
pop
(item)Return item and drop from frame.
pow
(other[, axis, level, fill_value])Get Exponential power of dataframe and other, element-wise (binary operator
pow
).prod
([axis, skipna, level, numeric_only, ...])Return the product of the values over the requested axis.
product
([axis, skipna, level, numeric_only, ...])Return the product of the values over the requested axis.
quantile
([q, axis, numeric_only, interpolation])Return values at the given quantile over requested axis.
query
(expr[, inplace])Query the columns of a DataFrame with a boolean expression.
radd
(other[, axis, level, fill_value])Get Addition of dataframe and other, element-wise (binary operator
radd
).rank
([axis, method, numeric_only, ...])Compute numerical data ranks (1 through n) along axis.
rdiv
(other[, axis, level, fill_value])Get Floating division of dataframe and other, element-wise (binary operator
rtruediv
).reindex
([labels, index, columns, axis, ...])Conform Series/DataFrame to new index with optional filling logic.
reindex_like
(other[, method, copy, limit, ...])Return an object with matching indices as other object.
rename
([mapper, index, columns, axis, copy, ...])Alter axes labels.
rename_axis
([mapper, index, columns, axis, ...])Set the name of the axis for the index or columns.
reorder_levels
(order[, axis])Rearrange index levels using input order.
replace
([to_replace, value, inplace, limit, ...])Replace values given in
to_replace
withvalue
.resample
(rule[, axis, closed, label, ...])Resample time-series data.
reset_index
([level, drop, inplace, ...])Reset the index, or a level of it.
rfloordiv
(other[, axis, level, fill_value])Get Integer division of dataframe and other, element-wise (binary operator
rfloordiv
).rmod
(other[, axis, level, fill_value])Get Modulo of dataframe and other, element-wise (binary operator
rmod
).rmul
(other[, axis, level, fill_value])Get Multiplication of dataframe and other, element-wise (binary operator
rmul
).rolling
(window[, min_periods, center, ...])Provide rolling window calculations.
round
([decimals])Round a DataFrame to a variable number of decimal places.
rpow
(other[, axis, level, fill_value])Get Exponential power of dataframe and other, element-wise (binary operator
rpow
).rsub
(other[, axis, level, fill_value])Get Subtraction of dataframe and other, element-wise (binary operator
rsub
).rtruediv
(other[, axis, level, fill_value])Get Floating division of dataframe and other, element-wise (binary operator
rtruediv
).sample
([n, frac, replace, weights, ...])Return a random sample of items from an axis of object.
select_dtypes
([include, exclude])Return a subset of the DataFrame's columns based on the column dtypes.
sem
([axis, skipna, level, ddof, numeric_only])Return unbiased standard error of the mean over requested axis.
set_axis
(labels[, axis, inplace])Assign desired index to given axis.
set_flags
(*[, copy, allows_duplicate_labels])Return a new object with updated flags.
set_index
(keys[, drop, append, inplace, ...])Set the DataFrame index using existing columns.
shift
([periods, freq, axis, fill_value])Shift index by desired number of periods with an optional time
freq
.skew
([axis, skipna, level, numeric_only])Return unbiased skew over requested axis.
slice_shift
([periods, axis])Equivalent to
shift
without copying data.sort_index
([axis, level, ascending, ...])Sort object by labels (along an axis).
sort_values
(by[, axis, ascending, inplace, ...])Sort by the values along either axis.
alias of
pandas.core.arrays.sparse.accessor.SparseFrameAccessor
squeeze
([axis])Squeeze 1 dimensional axis objects into scalars.
stack
([level, dropna])Stack the prescribed level(s) from columns to index.
std
([axis, skipna, level, ddof, numeric_only])Return sample standard deviation over requested axis.
sub
(other[, axis, level, fill_value])Get Subtraction of dataframe and other, element-wise (binary operator
sub
).subtract
(other[, axis, level, fill_value])Get Subtraction of dataframe and other, element-wise (binary operator
sub
).sum
([axis, skipna, level, numeric_only, ...])Return the sum of the values over the requested axis.
swap
([likelihood])Performs random swapping of data.
swapaxes
(axis1, axis2[, copy])Interchange axes and swap values axes appropriately.
swaplevel
([i, j, axis])Swap levels i and j in a
MultiIndex
.tail
([n])Return the last
n
rows.take
(indices[, axis, is_copy])Return the elements in the given positional indices along an axis.
to_clipboard
([excel, sep])Copy object to the system clipboard.
to_csv
([path_or_buf, sep, na_rep, ...])Write object to a comma-separated values (csv) file.
to_dict
([orient, into])Convert the DataFrame to a dictionary.
to_excel
(excel_writer[, sheet_name, na_rep, ...])Write object to an Excel sheet.
to_feather
(path, **kwargs)Write a DataFrame to the binary Feather format.
to_gbq
(destination_table[, project_id, ...])Write a DataFrame to a Google BigQuery table.
to_hdf
(path_or_buf, key[, mode, complevel, ...])Write the contained data to an HDF5 file using HDFStore.
to_html
([buf, columns, col_space, header, ...])Render a DataFrame as an HTML table.
to_json
([path_or_buf, orient, date_format, ...])Convert the object to a JSON string.
to_latex
([buf, columns, col_space, header, ...])Render object to a LaTeX tabular, longtable, or nested table/tabular.
to_markdown
([buf, mode, index, storage_options])Print DataFrame in Markdown-friendly format.
to_numpy
([dtype, copy, na_value])Convert the DataFrame to a NumPy array.
to_parquet
([path, engine, compression, ...])Write a DataFrame to the binary parquet format.
to_period
([freq, axis, copy])Convert DataFrame from DatetimeIndex to PeriodIndex.
to_pickle
(path[, compression, protocol, ...])Pickle (serialize) object to file.
to_records
([index, column_dtypes, index_dtypes])Convert DataFrame to a NumPy record array.
to_sql
(name, con[, schema, if_exists, ...])Write records stored in a DataFrame to a SQL database.
to_stata
(path[, convert_dates, write_index, ...])Export DataFrame object to Stata dta format.
to_string
([buf, columns, col_space, header, ...])Render a DataFrame to a console-friendly tabular output.
to_timestamp
([freq, how, axis, copy])Cast to DatetimeIndex of timestamps, at beginning of period.
Return an xarray object from the pandas object.
to_xml
([path_or_buffer, index, root_name, ...])Render a DataFrame to an XML document.
transform
(func[, axis])Call
func
on self producing a DataFrame with transformed values.transpose
(*args[, copy])Transpose index and columns.
truediv
(other[, axis, level, fill_value])Get Floating division of dataframe and other, element-wise (binary operator
truediv
).truncate
([before, after, axis, copy])Truncate a Series or DataFrame before and after some index value.
tshift
([periods, freq, axis])Shift the time index, using the index's frequency if available.
tz_convert
(tz[, axis, level, copy])Convert tz-aware axis to target time zone.
tz_localize
(tz[, axis, level, copy, ...])Localize tz-naive index of a Series or DataFrame to target time zone.
unstack
([level, fill_value])Pivot a level of the (necessarily hierarchical) index labels.
update
(other[, join, overwrite, ...])Modify in place using non-NA values from another DataFrame.
value_counts
([subset, normalize, sort, ...])Return a Series containing counts of unique rows in the DataFrame.
var
([axis, skipna, level, ddof, numeric_only])Return unbiased variance over requested axis.
where
(cond[, other, inplace, axis, level, ...])Replace values where the condition is False.
xs
(key[, axis, level, drop_level])Return cross-section from the Series/DataFrame.
- abs()[source]
Return a Series/DataFrame with absolute numeric value of each element.
This function only applies to elements that are all numeric.
- Returns
- abs
Series/DataFrame containing the absolute value of each element.
See alsonumpy.absolute
Calculate the absolute value element-wise.
Notes
For
complex
inputs,1.2 + 1j
, the absolute value is \(\sqrt{ a^2 + b^2 }\).Examples
Absolute numeric values in a Series.
>>> s = pd.Series([-1.10, 2, -3.33, 4]) >>> s.abs() 0 1.10 1 2.00 2 3.33 3 4.00 dtype: float64
Absolute numeric values in a Series with complex numbers.
>>> s = pd.Series([1.2 + 1j]) >>> s.abs() 0 1.56205 dtype: float64
Absolute numeric values in a Series with a Timedelta element.
>>> s = pd.Series([pd.Timedelta('1 days')]) >>> s.abs() 0 1 days dtype: timedelta64[ns]
Select rows with data closest to certain value using argsort (from StackOverflow).
>>> df = pd.DataFrame({ ... 'a': [4, 5, 6, 7], ... 'b': [10, 20, 30, 40], ... 'c': [100, 50, -30, -50] ... }) >>> df a b c 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 >>> df.loc[(df.c - 43).abs().argsort()] a b c 1 5 20 50 0 4 10 100 2 6 30 -30 3 7 40 -50
- add(other, axis='columns', level=None, fill_value=None)[source]
Get Addition of dataframe and other, element-wise (binary operator
add
).Equivalent to
dataframe + other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version,radd
.Among flexible wrappers (
add
,sub
,mul
,div
,mod
,pow
) to arithmetic operators:+
,-
,*
,/
,//
,%
,**
.- Parameters
- otherscalar, sequence, Series, or DataFrame
- axis{0 or ‘index’, 1 or ‘columns’}
- levelint or label
- fill_valuefloat or None, default None
Any single or multiple element data structure, or list-like object.
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
Broadcast across a level, matching Index values on the passed MultiIndex level.
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
- DataFrame
Result of the arithmetic operation.
See alsoDataFrame.add
DataFrame.sub
DataFrame.mul
DataFrame.div
DataFrame.truediv
DataFrame.floordiv
DataFrame.mod
DataFrame.pow
Add DataFrames.
Subtract DataFrames.
Multiply DataFrames.
Divide DataFrames (float division).
Divide DataFrames (float division).
Divide DataFrames (integer division).
Calculate modulo (remainder after division).
Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- add_prefix(prefix)[source]
Prefix labels with string
prefix
.For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.
- Parameters
- prefixstr
The string to add before each label.
- Returns
- Series or DataFrame
New Series or DataFrame with updated labels.
See alsoSeries.add_suffix
DataFrame.add_suffix
Suffix row labels with string
suffix
.Suffix column labels with string
suffix
.Examples
>>> s = pd.Series([1, 2, 3, 4]) >>> s 0 1 1 2 2 3 3 4 dtype: int64
>>> s.add_prefix('item_') item_0 1 item_1 2 item_2 3 item_3 4 dtype: int64
>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}) >>> df A B 0 1 3 1 2 4 2 3 5 3 4 6
>>> df.add_prefix('col_') col_A col_B 0 1 3 1 2 4 2 3 5 3 4 6
- add_suffix(suffix)[source]
Suffix labels with string
suffix
.For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.
- Parameters
- suffixstr
The string to add after each label.
- Returns
- Series or DataFrame
New Series or DataFrame with updated labels.
See alsoSeries.add_prefix
DataFrame.add_prefix
Prefix row labels with string
prefix
.Prefix column labels with string
prefix
.Examples
>>> s = pd.Series([1, 2, 3, 4]) >>> s 0 1 1 2 2 3 3 4 dtype: int64
>>> s.add_suffix('_item') 0_item 1 1_item 2 2_item 3 3_item 4 dtype: int64
>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]}) >>> df A B 0 1 3 1 2 4 2 3 5 3 4 6
>>> df.add_suffix('_col') A_col B_col 0 1 3 1 2 4 2 3 5 3 4 6
- agg(func=None, axis=0, *args, **kwargs)[source]
Aggregate using one or more operations over the specified axis.
- Parameters
- funcfunction, str, list or dict
function
string function name
list of functions and/or function names, e.g.
[np.sum, 'mean']
dict of axis labels -> functions, function names or list of such.
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- *args
- **kwargs
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.
Accepted combinations are:
If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
Positional arguments to pass to
func
.Keyword arguments to pass to
func
.- Returns
- scalar, Series or DataFrame
scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
- The aggregation operations are always performed over an axis, either the
- index (default) or the column axis. This behavior is different from
numpy
aggregation functions (mean
,median
,prod
,sum
,std
,var
), where the default is to compute the aggregation of the flattened- array, e.g.,
numpy.mean(arr_2d)
as opposed to numpy.mean(arr_2d, axis=0)
.agg
is an alias foraggregate
. Use the alias.
The return can be:
Return scalar, Series or DataFrame.
See alsoDataFrame.apply
DataFrame.transform
core.groupby.GroupBy
core.resample.Resampler
core.window.Rolling
core.window.Expanding
core.window.ExponentialMovingWindow
Perform any type of operations.
Perform transformation type operations.
Perform operations over groups.
Perform operations over resampled bins.
Perform operations over rolling window.
Perform operations over expanding window.
Perform operation over exponential weighted window.
Notes
agg
is an alias foraggregate
. Use the alias.Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> df = pd.DataFrame([[1, 2, 3], ... [4, 5, 6], ... [7, 8, 9], ... [np.nan, np.nan, np.nan]], ... columns=['A', 'B', 'C'])
Aggregate these functions over the rows.
>>> df.agg(['sum', 'min']) A B C sum 12.0 15.0 18.0 min 1.0 2.0 3.0
Different aggregations per column.
>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}) A B sum 12.0 NaN min 1.0 2.0 max NaN 8.0
Aggregate different functions over the columns and rename the index of the resulting DataFrame.
>>> df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean)) A B C x 7.0 NaN NaN y NaN 2.0 NaN z NaN NaN 6.0
Aggregate over the columns.
>>> df.agg("mean", axis="columns") 0 2.0 1 5.0 2 8.0 3 NaN dtype: float64
- aggregate(func=None, axis=0, *args, **kwargs)[source]
Aggregate using one or more operations over the specified axis.
- Parameters
- funcfunction, str, list or dict
function
string function name
list of functions and/or function names, e.g.
[np.sum, 'mean']
dict of axis labels -> functions, function names or list of such.
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- *args
- **kwargs
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.
Accepted combinations are:
If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
Positional arguments to pass to
func
.Keyword arguments to pass to
func
.- Returns
- scalar, Series or DataFrame
scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
- The aggregation operations are always performed over an axis, either the
- index (default) or the column axis. This behavior is different from
numpy
aggregation functions (mean
,median
,prod
,sum
,std
,var
), where the default is to compute the aggregation of the flattened- array, e.g.,
numpy.mean(arr_2d)
as opposed to numpy.mean(arr_2d, axis=0)
.agg
is an alias foraggregate
. Use the alias.
The return can be:
Return scalar, Series or DataFrame.
See alsoDataFrame.apply
DataFrame.transform
core.groupby.GroupBy
core.resample.Resampler
core.window.Rolling
core.window.Expanding
core.window.ExponentialMovingWindow
Perform any type of operations.
Perform transformation type operations.
Perform operations over groups.
Perform operations over resampled bins.
Perform operations over rolling window.
Perform operations over expanding window.
Perform operation over exponential weighted window.
Notes
agg
is an alias foraggregate
. Use the alias.Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> df = pd.DataFrame([[1, 2, 3], ... [4, 5, 6], ... [7, 8, 9], ... [np.nan, np.nan, np.nan]], ... columns=['A', 'B', 'C'])
Aggregate these functions over the rows.
>>> df.agg(['sum', 'min']) A B C sum 12.0 15.0 18.0 min 1.0 2.0 3.0
Different aggregations per column.
>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}) A B sum 12.0 NaN min 1.0 2.0 max NaN 8.0
Aggregate different functions over the columns and rename the index of the resulting DataFrame.
>>> df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean)) A B C x 7.0 NaN NaN y NaN 2.0 NaN z NaN NaN 6.0
Aggregate over the columns.
>>> df.agg("mean", axis="columns") 0 2.0 1 5.0 2 8.0 3 NaN dtype: float64
- align(other, join='outer', axis=None, level=None, copy=True, fill_value=None, method=None, limit=None, fill_axis=0, broadcast_axis=None)[source]
Align two objects on their axes with the specified join method.
Join method is specified for each axis Index.
- Parameters
- otherDataFrame or Series
- join{‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’
- axisallowed axis of the other object, default None
- levelint or level name, default None
- copybool, default True
- fill_valuescalar, default np.NaN
- method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
pad / ffill: propagate last valid observation forward to next valid.
backfill / bfill: use NEXT valid observation to fill gap.
- limitint, default None
- fill_axis{0 or ‘index’, 1 or ‘columns’}, default 0
- broadcast_axis{0 or ‘index’, 1 or ‘columns’}, default None
Align on index (0), columns (1), or both (None).
Broadcast across a level, matching Index values on the passed MultiIndex level.
Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
Method to use for filling holes in reindexed Series:
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
Filling axis, method and limit.
Broadcast values along this axis, if aligning two objects of different dimensions.
- Returns
- (left, right)(DataFrame, type of other)
Aligned objects.
- all(axis=0, bool_only=None, skipna=True, level=None, **kwargs)[source]
Return whether all elements are True, potentially over an axis.
Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).
- Parameters
- axis{0 or ‘index’, 1 or ‘columns’, None}, default 0
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
- bool_onlybool, default None
- skipnabool, default True
- levelint or level name, default None
- **kwargsany, default None
Indicate which axis or axes should be reduced.
Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns
- Series or DataFrame
If level is specified, then, DataFrame is returned; otherwise, Series is returned.
See alsoSeries.all
DataFrame.any
Return True if all elements are True.
Return True if one (or more) elements are True.
Examples
Series
>>> pd.Series([True, True]).all() True >>> pd.Series([True, False]).all() False >>> pd.Series([], dtype="float64").all() True >>> pd.Series([np.nan]).all() True >>> pd.Series([np.nan]).all(skipna=False) True
DataFrames
Create a dataframe from a dictionary.
>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]}) >>> df col1 col2 0 True True 1 True False
Default behaviour checks if column-wise values all return True.
>>> df.all() col1 True col2 False dtype: bool
Specify
axis='columns'
to check if row-wise values all return True.>>> df.all(axis='columns') 0 True 1 False dtype: bool
Or
axis=None
for whether every value is True.>>> df.all(axis=None) False
- any(axis=0, bool_only=None, skipna=True, level=None, **kwargs)[source]
Return whether any element is True, potentially over an axis.
Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).
- Parameters
- axis{0 or ‘index’, 1 or ‘columns’, None}, default 0
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
- bool_onlybool, default None
- skipnabool, default True
- levelint or level name, default None
- **kwargsany, default None
Indicate which axis or axes should be reduced.
Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns
- Series or DataFrame
If level is specified, then, DataFrame is returned; otherwise, Series is returned.
See alsonumpy.any
Series.any
Series.all
DataFrame.any
DataFrame.all
Numpy version of this method.
Return whether any element is True.
Return whether all elements are True.
Return whether any element is True over requested axis.
Return whether all elements are True over requested axis.
Examples
Series
For Series input, the output is a scalar indicating whether any element is True.
>>> pd.Series([False, False]).any() False >>> pd.Series([True, False]).any() True >>> pd.Series([], dtype="float64").any() False >>> pd.Series([np.nan]).any() False >>> pd.Series([np.nan]).any(skipna=False) True
DataFrame
Whether each column contains at least one True element (the default).
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]}) >>> df A B C 0 1 0 0 1 2 2 0
>>> df.any() A True B True C False dtype: bool
Aggregating over the columns.
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]}) >>> df A B 0 True 1 1 False 2
>>> df.any(axis='columns') 0 True 1 True dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]}) >>> df A B 0 True 1 1 False 0
>>> df.any(axis='columns') 0 True 1 False dtype: bool
Aggregating over the entire DataFrame with
axis=None
.>>> df.any(axis=None) True
any
for an empty DataFrame is an empty Series.>>> pd.DataFrame([]).any() Series([], dtype: bool)
- append(other, ignore_index=False, verify_integrity=False, sort=False)[source]
Append rows of
other
to the end of caller, returning a new object.Columns in
other
that are not in the caller are added as new columns.- Parameters
- otherDataFrame or Series/dict-like object, or list of these
- ignore_indexbool, default False
- verify_integritybool, default False
- sortbool, default False
The data to append.
If True, the resulting axis will be labeled 0, 1, …, n - 1.
If True, raise ValueError on creating index with duplicates.
Sort columns if the columns of
self
andother
are not aligned.Changed in version 1.0.0:Changed to not sort by default.
- Returns
- DataFrame
A new DataFrame consisting of the rows of caller and the rows of
other
.
See alsoconcat
General function to concatenate DataFrame or Series objects.
Notes
If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
Examples
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'), index=['x', 'y']) >>> df A B x 1 2 y 3 4 >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'), index=['x', 'y']) >>> df.append(df2) A B x 1 2 y 3 4 x 5 6 y 7 8
With
ignore_index
set to True:>>> df.append(df2, ignore_index=True) A B 0 1 2 1 3 4 2 5 6 3 7 8
The following, while not recommended methods for generating DataFrames, show two ways to generate a DataFrame from multiple data sources.
Less efficient:
>>> df = pd.DataFrame(columns=['A']) >>> for i in range(5): ... df = df.append({'A': i}, ignore_index=True) >>> df A 0 0 1 1 2 2 3 3 4 4
More efficient:
>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)], ... ignore_index=True) A 0 0 1 1 2 2 3 3 4 4
- apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)[source]
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is either the DataFrame’s index (
axis=0
) or the DataFrame’s columns (axis=1
). By default (result_type=None
), the final return type is inferred from the return type of the applied function. Otherwise, it depends on theresult_type
argument.- Parameters
- funcfunction
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.
- rawbool, default False
False
: passes each row or column as a Series to the function.True
: the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.- result_type{‘expand’, ‘reduce’, ‘broadcast’, None}, default None
‘expand’ : list-like results will be turned into columns.
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
- argstuple
- **kwargs
Function to apply to each column or row.
Axis along which the function is applied:
Determines if row or column is passed as a Series or ndarray object:
These only act when
axis=1
(columns):The default behaviour (None) depends on the return value of the applied function: list-like results will be returned as a Series of those. However if the apply function returns a Series these are expanded to columns.
Positional arguments to pass to
func
in addition to the array/series.Additional keyword arguments to pass as keywords arguments to
func
.- Returns
- Series or DataFrame
Result of applying
func
along the given axis of the DataFrame.
See alsoDataFrame.applymap
DataFrame.aggregate
DataFrame.transform
For elementwise operations.
Only perform aggregating type operations.
Only perform transforming type operations.
Notes
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
Examples
>>> df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B']) >>> df A B 0 4 9 1 4 9 2 4 9
Using a numpy universal function (in this case the same as
np.sqrt(df)
):>>> df.apply(np.sqrt) A B 0 2.0 3.0 1 2.0 3.0 2 2.0 3.0
Using a reducing function on either axis
>>> df.apply(np.sum, axis=0) A 12 B 27 dtype: int64
>>> df.apply(np.sum, axis=1) 0 13 1 13 2 13 dtype: int64
Returning a list-like will result in a Series
>>> df.apply(lambda x: [1, 2], axis=1) 0 [1, 2] 1 [1, 2] 2 [1, 2] dtype: object
Passing
result_type='expand'
will expand list-like results to columns of a Dataframe>>> df.apply(lambda x: [1, 2], axis=1, result_type='expand') 0 1 0 1 2 1 1 2 2 1 2
Returning a Series inside the function is similar to passing
result_type='expand'
. The resulting column names will be the Series index.>>> df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1) foo bar 0 1 2 1 1 2 2 1 2
Passing
result_type='broadcast'
will ensure the same shape result, whether list-like or scalar is returned by the function, and broadcast it along the axis. The resulting column names will be the originals.>>> df.apply(lambda x: [1, 2], axis=1, result_type='broadcast') A B 0 1 2 1 1 2 2 1 2
- applymap(func, na_action=None, **kwargs)[source]
Apply a function to a Dataframe elementwise.
This method applies a function that accepts and returns a scalar to every element of a DataFrame.
- Parameters
- funccallable
- na_action{None, ‘ignore’}, default None
- **kwargs
Python function, returns a single value from a single value.
If ‘ignore’, propagate NaN values, without passing them to func.
New in version 1.2.
Additional keyword arguments to pass as keywords arguments to
func
.New in version 1.3.0.
- Returns
- DataFrame
Transformed DataFrame.
See alsoDataFrame.apply
Apply a function along input axis of DataFrame.
Examples
>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]]) >>> df 0 1 0 1.000 2.120 1 3.356 4.567
>>> df.applymap(lambda x: len(str(x))) 0 1 0 3 4 1 5 5
Like Series.map, NA values can be ignored:
>>> df_copy = df.copy() >>> df_copy.iloc[0, 0] = pd.NA >>> df_copy.applymap(lambda x: len(str(x)), na_action='ignore') 0 1 0 <NA> 4 1 5 5
Note that a vectorized version of
func
often exists, which will be much faster. You could square each number elementwise.>>> df.applymap(lambda x: x**2) 0 1 0 1.000000 4.494400 1 11.262736 20.857489
But it’s better to avoid applymap in that case.
>>> df ** 2 0 1 0 1.000000 4.494400 1 11.262736 20.857489
- asfreq(freq, method=None, how=None, normalize=False, fill_value=None)[source]
Convert time series to specified frequency.
Returns the original data conformed to a new index with the specified frequency.
If the index of this DataFrame is a
PeriodIndex
, the new index is the result of transforming the original index withPeriodIndex.asfreq
(so the original index will map one-to-one to the new index).Otherwise, the new index will be equivalent to
pd.date_range(start, end, freq=freq)
wherestart
andend
are, respectively, the first and last entries in the original index (seepandas.date_range()
). The values corresponding to any timesteps in the new index which were not present in the original index will be null (NaN
), unless a method for filling such unknowns is provided (see themethod
parameter below).The
resample()
method is more appropriate if an operation on each group of timesteps (such as an aggregate) is necessary to represent the data at the new frequency.- Parameters
- freqDateOffset or str
- method{‘backfill’/’bfill’, ‘pad’/’ffill’}, default None
‘pad’ / ‘ffill’: propagate last valid observation forward to next valid
‘backfill’ / ‘bfill’: use NEXT valid observation to fill.
- how{‘start’, ‘end’}, default end
- normalizebool, default False
- fill_valuescalar, optional
Frequency DateOffset or string.
Method to use for filling holes in reindexed Series (note this does not fill NaNs that already were present):
For PeriodIndex only (see PeriodIndex.asfreq).
Whether to reset output index to midnight.
Value to use for missing values, applied during upsampling (note this does not fill NaNs that already were present).
- Returns
- DataFrame
DataFrame object reindexed to the specified frequency.
See alsoreindex
Conform DataFrame to new index with optional filling logic.
Notes
To learn more about the frequency strings, please see this link.
Examples
Start by creating a series with 4 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=4, freq='T') >>> series = pd.Series([0.0, None, 2.0, 3.0], index=index) >>> df = pd.DataFrame({'s': series}) >>> df s 2000-01-01 00:00:00 0.0 2000-01-01 00:01:00 NaN 2000-01-01 00:02:00 2.0 2000-01-01 00:03:00 3.0
Upsample the series into 30 second bins.
>>> df.asfreq(freq='30S') s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 NaN 2000-01-01 00:03:00 3.0
Upsample again, providing a
fill value
.>>> df.asfreq(freq='30S', fill_value=9.0) s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 9.0 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 9.0 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 9.0 2000-01-01 00:03:00 3.0
Upsample again, providing a
method
.>>> df.asfreq(freq='30S', method='bfill') s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 2.0 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 3.0 2000-01-01 00:03:00 3.0
- asof(where, subset=None)[source]
Return the last row(s) without any NaNs before
where
.The last row (for each element in
where
, if list) without any NaN is taken. In case of aDataFrame
, the last row without NaN considering only the subset of columns (if notNone
)If there is no good value, NaN is returned for a Series or a Series of NaN values for a DataFrame
- Parameters
- Returns
- scalar, Series, or DataFrame
scalar : when
self
is a Series andwhere
is a scalarSeries: when
self
is a Series andwhere
is an array-like, or whenself
is a DataFrame andwhere
is a scalarDataFrame : when
self
is a DataFrame andwhere
is an array-like
The return can be:
Return scalar, Series, or DataFrame.
See alsomerge_asof
Perform an asof merge. Similar to left join.
Notes
Dates are assumed to be sorted. Raises if this is not the case.
Examples
A Series and a scalar
where
.>>> s = pd.Series([1, 2, np.nan, 4], index=[10, 20, 30, 40]) >>> s 10 1.0 20 2.0 30 NaN 40 4.0 dtype: float64
>>> s.asof(20) 2.0
For a sequence
where
, a Series is returned. The first value is NaN, because the first element ofwhere
is before the first index value.>>> s.asof([5, 20]) 5 NaN 20 2.0 dtype: float64
Missing values are not considered. The following is
2.0
, not NaN, even though NaN is at the index location for30
.>>> s.asof(30) 2.0
Take all columns into consideration
>>> df = pd.DataFrame({'a': [10, 20, 30, 40, 50], ... 'b': [None, None, None, None, 500]}, ... index=pd.DatetimeIndex(['2018-02-27 09:01:00', ... '2018-02-27 09:02:00', ... '2018-02-27 09:03:00', ... '2018-02-27 09:04:00', ... '2018-02-27 09:05:00'])) >>> df.asof(pd.DatetimeIndex(['2018-02-27 09:03:30', ... '2018-02-27 09:04:30'])) a b 2018-02-27 09:03:30 NaN NaN 2018-02-27 09:04:30 NaN NaN
Take a single column into consideration
>>> df.asof(pd.DatetimeIndex(['2018-02-27 09:03:30', ... '2018-02-27 09:04:30']), ... subset=['a']) a b 2018-02-27 09:03:30 30.0 NaN 2018-02-27 09:04:30 40.0 NaN
- assign(**kwargs)[source]
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
- Parameters
- **kwargsdict of {str: callable or Series}
The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.
- Returns
- DataFrame
A new DataFrame with the new columns in addition to all the existing columns.
Notes
Assigning multiple columns within the same
assign
is possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.Examples
>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]}, ... index=['Portland', 'Berkeley']) >>> df temp_c Portland 17.0 Berkeley 25.0
Where the value is a callable, evaluated on
df
:>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0
Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence:
>>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0
You can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign:
>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32, ... temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9) temp_c temp_f temp_k Portland 17.0 62.6 290.15 Berkeley 25.0 77.0 298.15
- astype(dtype, copy=True, errors='raise')[source]
Cast a pandas object to a specified dtype
dtype
.- Parameters
- dtypedata type, or dict of column name -> data type
- copybool, default True
- errors{‘raise’, ‘ignore’}, default ‘raise’
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object.
Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
Return a copy when
copy=True
(be very careful settingcopy=False
as changes to values then may propagate to other pandas objects).Control raising of exceptions on invalid data for provided dtype.
- Returns
- castedsame type as caller
See alsoto_datetime
to_timedelta
to_numeric
numpy.ndarray.astype
Convert argument to datetime.
Convert argument to timedelta.
Convert argument to a numeric type.
Cast a numpy array to a specified type.
Notes
Deprecated since version 1.3.0:Using
astype
to convert from timezone-naive dtype to timezone-aware dtype is deprecated and will raise in a future version. UseSeries.dt.tz_localize()
instead.Examples
Create a DataFrame:
>>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df.dtypes col1 int64 col2 int64 dtype: object
Cast all columns to int32:
>>> df.astype('int32').dtypes col1 int32 col2 int32 dtype: object
Cast col1 to int32 using a dictionary:
>>> df.astype({'col1': 'int32'}).dtypes col1 int32 col2 int64 dtype: object
Create a series:
>>> ser = pd.Series([1, 2], dtype='int32') >>> ser 0 1 1 2 dtype: int32 >>> ser.astype('int64') 0 1 1 2 dtype: int64
Convert to categorical type:
>>> ser.astype('category') 0 1 1 2 dtype: category Categories (2, int64): [1, 2]
Convert to ordered categorical type with custom ordering:
>>> from pandas.api.types import CategoricalDtype >>> cat_dtype = CategoricalDtype( ... categories=[2, 1], ordered=True) >>> ser.astype(cat_dtype) 0 1 1 2 dtype: category Categories (2, int64): [2 < 1]
Note that using
copy=False
and changing data on a new pandas object may propagate changes:>>> s1 = pd.Series([1, 2]) >>> s2 = s1.astype('int64', copy=False) >>> s2[0] = 10 >>> s1 # note that s1[0] has changed too 0 10 1 2 dtype: int64
Create a series of dates:
>>> ser_date = pd.Series(pd.date_range('20200101', periods=3)) >>> ser_date 0 2020-01-01 1 2020-01-02 2 2020-01-03 dtype: datetime64[ns]
- property at: pandas.core.indexing._AtIndexer
Access a single value for a row/column label pair.
Similar to
loc
, in that both provide label-based lookups. Useat
if you only need to get or set a single value in a DataFrame or Series.- Raises
- KeyError
If ‘label’ does not exist in DataFrame.
See alsoDataFrame.iat
DataFrame.loc
Series.at
Access a single value for a row/column pair by integer position.
Access a group of rows and columns by label(s).
Access a single value using a label.
Examples
>>> df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], ... index=[4, 5, 6], columns=['A', 'B', 'C']) >>> df A B C 4 0 2 3 5 0 4 1 6 10 20 30
Get value at specified row/column pair
>>> df.at[4, 'B'] 2
Set value at specified row/column pair
>>> df.at[4, 'B'] = 10 >>> df.at[4, 'B'] 10
Get value within a Series
>>> df.loc[5].at['B'] 4
- at_time(time, asof=False, axis=None)[source]
Select values at particular time of day (e.g., 9:30AM).
- Parameters
- timedatetime.time or str
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- Returns
- Series or DataFrame
- Raises
- TypeError
If the index is not a
DatetimeIndex
See alsobetween_time
first
last
DatetimeIndex.indexer_at_time
Select values between particular times of the day.
Select initial periods of time series based on a date offset.
Select final periods of time series based on a date offset.
Get just the index locations for values at particular time of the day.
Examples
>>> i = pd.date_range('2018-04-09', periods=4, freq='12H') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 00:00:00 1 2018-04-09 12:00:00 2 2018-04-10 00:00:00 3 2018-04-10 12:00:00 4
>>> ts.at_time('12:00') A 2018-04-09 12:00:00 2 2018-04-10 12:00:00 4
- property attrs: dict[Hashable, Any]
Dictionary of global attributes of this dataset.
Warningattrs is experimental and may change without warning.
See alsoDataFrame.flags
Global flags applying to this object.
- property axes: list[Index]
Return a list representing the axes of the DataFrame.
It has the row axis labels and column axis labels as the only members. They are returned in that order.
Examples
- backfill(axis=None, inplace=False, limit=None, downcast=None)[source]
Synonym for
DataFrame.fillna()
withmethod='bfill'
.- Returns
- Series/DataFrame or None
Object with missing values filled or None if
inplace=True
.
- between_time(start_time, end_time, include_start=True, include_end=True, axis=None)[source]
Select values between particular times of the day (e.g., 9:00-9:30 AM).
By setting
start_time
to be later thanend_time
, you can get the times that are not between the two times.- Parameters
- start_timedatetime.time or str
- end_timedatetime.time or str
- include_startbool, default True
- include_endbool, default True
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
Initial time as a time filter limit.
End time as a time filter limit.
Whether the start time needs to be included in the result.
Whether the end time needs to be included in the result.
Determine range time on index or columns value.
- Returns
- Series or DataFrame
Data from the original object filtered to the specified dates range.
- Raises
- TypeError
If the index is not a
DatetimeIndex
See alsoExamples
>>> i = pd.date_range('2018-04-09', periods=4, freq='1D20min') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 00:00:00 1 2018-04-10 00:20:00 2 2018-04-11 00:40:00 3 2018-04-12 01:00:00 4
>>> ts.between_time('0:15', '0:45') A 2018-04-10 00:20:00 2 2018-04-11 00:40:00 3
You get the times that are not between two times by setting
start_time
later thanend_time
:>>> ts.between_time('0:45', '0:15') A 2018-04-09 00:00:00 1 2018-04-12 01:00:00 4
- bfill(axis=None, inplace=False, limit=None, downcast=None)[source]
Synonym for
DataFrame.fillna()
withmethod='bfill'
.- Returns
- Series/DataFrame or None
Object with missing values filled or None if
inplace=True
.
- bool()[source]
Return the bool of a single element Series or DataFrame.
This must be a boolean scalar value, either True or False. It will raise a ValueError if the Series or DataFrame does not have exactly 1 element, or that element is not boolean (integer values 0 and 1 will also raise an exception).
- Returns
- bool
The value in the Series or DataFrame.
See alsoSeries.astype
DataFrame.astype
numpy.bool_
Change the data type of a Series, including to boolean.
Change the data type of a DataFrame, including to boolean.
NumPy boolean data type, used by pandas for boolean values.
Examples
The method will only work for single element objects with a boolean value:
>>> pd.Series([True]).bool() True >>> pd.Series([False]).bool() False
>>> pd.DataFrame({'col': [True]}).bool() True >>> pd.DataFrame({'col': [False]}).bool() False
- boxplot(column=None, by=None, ax=None, fontsize=None, rot=0, grid=True, figsize=None, layout=None, return_type=None, backend=None, **kwargs)[source]
Make a box plot from DataFrame columns.
Make a box-and-whisker plot from DataFrame columns, optionally grouped by some other columns. A box plot is a method for graphically depicting groups of numerical data through their quartiles. The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of box to show the range of the data. By default, they extend no more than
1.5 * IQR (IQR = Q3 - Q1)
from the edges of the box, ending at the farthest data point within that interval. Outliers are plotted as separate dots.For further details see Wikipedia’s entry for boxplot.
- Parameters
- columnstr or list of str, optional
- bystr or array-like, optional
- axobject of class matplotlib.axes.Axes, optional
- fontsizefloat or str
- rotint or float, default 0
- gridbool, default True
- figsizeA tuple (width, height) in inches
- layouttuple (rows, columns), optional
- return_type{‘axes’, ‘dict’, ‘both’} or None, default ‘axes’
‘axes’ returns the matplotlib axes the boxplot is drawn on.
‘dict’ returns a dictionary whose values are the matplotlib Lines of the boxplot.
‘both’ returns a namedtuple with the axes and dict.
when grouping with
by
, a Series mapping columns toreturn_type
is returned.If
return_type
isNone
, a NumPy array of axes with the same shape aslayout
is returned.- backendstr, default None
- **kwargs
Column name or list of names, or vector. Can be any valid input to
pandas.DataFrame.groupby()
.Column in the DataFrame to
pandas.DataFrame.groupby()
. One box-plot will be done per value of columns inby
.The matplotlib axes to be used by boxplot.
Tick label font size in points or as a string (e.g.,
large
).The rotation angle of labels (in degrees) with respect to the screen coordinate system.
Setting this to True will show the grid.
The size of the figure to create in matplotlib.
For example, (3, 5) will display the subplots using 3 columns and 5 rows, starting from the top-left.
The kind of object to return. The default is
axes
.Backend to use instead of the backend specified in the option
plotting.backend
. For instance, ‘matplotlib’. Alternatively, to specify theplotting.backend
for the whole session, setpd.options.plotting.backend
.New in version 1.0.0.
All other plotting keyword arguments to be passed to
matplotlib.pyplot.boxplot()
.- Returns
- result
See Notes.
See alsoSeries.plot.hist
matplotlib.pyplot.boxplot
Make a histogram.
Matplotlib equivalent plot.
Notes
The return type depends on the
return_type
parameter:‘axes’ : object of class matplotlib.axes.Axes
‘dict’ : dict of matplotlib.lines.Line2D objects
‘both’ : a namedtuple with structure (ax, lines)
For data grouped with
by
, return a Series of the above or a numpy array:Series
array
(forreturn_type = None
)
Use
return_type='dict'
when you want to tweak the appearance of the lines after plotting. In this case a dict containing the Lines making up the boxes, caps, fliers, medians, and whiskers is returned.Examples
Boxplots can be created for every column in the dataframe by
df.boxplot()
or indicating the columns to be used:Boxplots of variables distributions grouped by the values of a third variable can be created using the option
by
. For instance:A list of strings (i.e.
['X', 'Y']
) can be passed to boxplot in order to group the data by combination of the variables in the x-axis:The layout of boxplot can be adjusted giving a tuple to
layout
:Additional formatting can be done to the boxplot, like suppressing the grid (
grid=False
), rotating the labels in the x-axis (i.e.rot=45
) or changing the fontsize (i.e.fontsize=15
):The parameter
return_type
can be used to select the type of element returned byboxplot
. Whenreturn_type='axes'
is selected, the matplotlib axes on which the boxplot is drawn are returned:>>> boxplot = df.boxplot(column=['Col1', 'Col2'], return_type='axes') >>> type(boxplot) <class 'matplotlib.axes._subplots.AxesSubplot'>
When grouping with
by
, a Series mapping columns toreturn_type
is returned:>>> boxplot = df.boxplot(column=['Col1', 'Col2'], by='X', ... return_type='axes') >>> type(boxplot) <class 'pandas.core.series.Series'>
If
return_type
isNone
, a NumPy array of axes with the same shape aslayout
is returned:>>> boxplot = df.boxplot(column=['Col1', 'Col2'], by='X', ... return_type=None) >>> type(boxplot) <class 'numpy.ndarray'>
- clip(lower=None, upper=None, axis=None, inplace=False, *args, **kwargs)[source]
Trim values at input threshold(s).
Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.
- Parameters
- lowerfloat or array-like, default None
- upperfloat or array-like, default None
- axisint or str axis name, optional
- inplacebool, default False
- *args, **kwargs
Minimum threshold value. All values below this threshold will be set to it. A missing threshold (e.g
NA
) will not clip the value.Maximum threshold value. All values above this threshold will be set to it. A missing threshold (e.g
NA
) will not clip the value.Align object with lower and upper along the given axis.
Whether to perform the operation in place on the data.
Additional keywords have no effect but might be accepted for compatibility with numpy.
- Returns
- Series or DataFrame or None
Same type as calling object with the values outside the clip boundaries replaced or None if
inplace=True
.
See alsoSeries.clip
DataFrame.clip
numpy.clip
Trim values at input threshold in series.
Trim values at input threshold in dataframe.
Clip (limit) the values in an array.
Examples
>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]} >>> df = pd.DataFrame(data) >>> df col_0 col_1 0 9 -2 1 -3 -7 2 0 6 3 -1 8 4 5 -5
Clips per column using lower and upper thresholds:
>>> df.clip(-4, 6) col_0 col_1 0 6 -2 1 -3 -4 2 0 6 3 -1 6 4 5 -4
Clips using specific lower and upper thresholds per column element:
>>> t = pd.Series([2, -4, -1, 6, 3]) >>> t 0 2 1 -4 2 -1 3 6 4 3 dtype: int64
>>> df.clip(t, t + 4, axis=0) col_0 col_1 0 6 2 1 -3 -4 2 0 3 3 6 8 4 5 3
Clips using specific lower threshold per column element, with missing values:
>>> t = pd.Series([2, -4, np.NaN, 6, 3]) >>> t 0 2.0 1 -4.0 2 NaN 3 6.0 4 3.0 dtype: float64
>>> df.clip(t, axis=0) col_0 col_1 0 9 2 1 -3 -4 2 0 6 3 6 8 4 5 3
- columns: Index
The column labels of the DataFrame.
- combine(other, func, fill_value=None, overwrite=True)[source]
Perform column-wise combine with another DataFrame.
Combines a DataFrame with
other
DataFrame usingfunc
to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.- Parameters
- otherDataFrame
- funcfunction
- fill_valuescalar value, default None
- overwritebool, default True
The DataFrame to merge column-wise.
Function that takes two series as inputs and return a Series or a scalar. Used to merge the two dataframes column by columns.
The value to fill NaNs with prior to passing any column to the merge func.
If True, columns in
self
that do not exist inother
will be overwritten with NaNs.- Returns
- DataFrame
Combination of the provided DataFrames.
See alsoDataFrame.combine_first
Combine two DataFrame objects and default to non-null values in frame calling the method.
Examples
Combine using a simple function that chooses the smaller column.
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2 >>> df1.combine(df2, take_smaller) A B 0 0 3 1 0 3
Example using a true element-wise combine function.
>>> df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine(df2, np.minimum) A B 0 1 2 1 0 3
Using
fill_value
fills Nones prior to passing the column to the merge function.>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine(df2, take_smaller, fill_value=-5) A B 0 0 -5.0 1 0 4.0
However, if the same element in both dataframes is None, that None is preserved
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [None, 3]}) >>> df1.combine(df2, take_smaller, fill_value=-5) A B 0 0 -5.0 1 0 3.0
Example that demonstrates the use of
overwrite
and behavior when the axis differ between the dataframes.>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [-10, 1], }, index=[1, 2]) >>> df1.combine(df2, take_smaller) A B C 0 NaN NaN NaN 1 NaN 3.0 -10.0 2 NaN 3.0 1.0
>>> df1.combine(df2, take_smaller, overwrite=False) A B C 0 0.0 NaN NaN 1 0.0 3.0 -10.0 2 NaN 3.0 1.0
Demonstrating the preference of the passed in dataframe.
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1], }, index=[1, 2]) >>> df2.combine(df1, take_smaller) A B C 0 0.0 NaN NaN 1 0.0 3.0 NaN 2 NaN 3.0 NaN
>>> df2.combine(df1, take_smaller, overwrite=False) A B C 0 0.0 NaN NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
- combine_first(other)[source]
Update null elements with value in the same location in
other
.Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two.
- Parameters
- otherDataFrame
Provided DataFrame to use to fill null values.
- Returns
- DataFrame
The result of combining the provided DataFrame with the other object.
See alsoDataFrame.combine
Perform series-wise operation on two DataFrames using a given function.
Examples
>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine_first(df2) A B 0 1.0 3.0 1 0.0 4.0
Null values still persist if the location of that null value does not exist in
other
>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [4, None]}) >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2]) >>> df1.combine_first(df2) A B C 0 NaN 4.0 NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
- compare(other, align_axis=1, keep_shape=False, keep_equal=False)[source]
Compare to another DataFrame and show the differences.
New in version 1.1.0.
- Parameters
- otherDataFrame
- align_axis{0 or ‘index’, 1 or ‘columns’}, default 1
Determine which axis to align the comparison on.
- 0, or ‘index’Resulting differences are stacked vertically
with rows drawn alternately from self and other.
- 1, or ‘columns’Resulting differences are aligned horizontally
with columns drawn alternately from self and other.
- keep_shapebool, default False
- keep_equalbool, default False
Object to compare with.
If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.
If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.
- Returns
- DataFrame
DataFrame that shows the differences stacked side by side.
The resulting index will be a MultiIndex with ‘self’ and ‘other’ stacked alternately at the inner level.
- Raises
- ValueError
When the two DataFrames don’t have identical labels or shape.
See alsoSeries.compare
DataFrame.equals
Compare with another Series and show differences.
Test whether two objects contain the same elements.
Notes
Matching NaNs will not appear as a difference.
Can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames
Examples
>>> df = pd.DataFrame( ... { ... "col1": ["a", "a", "b", "b", "a"], ... "col2": [1.0, 2.0, 3.0, np.nan, 5.0], ... "col3": [1.0, 2.0, 3.0, 4.0, 5.0] ... }, ... columns=["col1", "col2", "col3"], ... ) >>> df col1 col2 col3 0 a 1.0 1.0 1 a 2.0 2.0 2 b 3.0 3.0 3 b NaN 4.0 4 a 5.0 5.0
>>> df2 = df.copy() >>> df2.loc[0, 'col1'] = 'c' >>> df2.loc[2, 'col3'] = 4.0 >>> df2 col1 col2 col3 0 c 1.0 1.0 1 a 2.0 2.0 2 b 3.0 4.0 3 b NaN 4.0 4 a 5.0 5.0
Align the differences on columns
>>> df.compare(df2) col1 col3 self other self other 0 a c NaN NaN 2 NaN NaN 3.0 4.0
Stack the differences on rows
>>> df.compare(df2, align_axis=0) col1 col3 0 self a NaN other c NaN 2 self NaN 3.0 other NaN 4.0
Keep the equal values
>>> df.compare(df2, keep_equal=True) col1 col3 self other self other 0 a c 1.0 1.0 2 b b 3.0 4.0
Keep all original rows and columns
>>> df.compare(df2, keep_shape=True) col1 col2 col3 self other self other self other 0 a c NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN 3.0 4.0 3 NaN NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN
Keep all original rows and columns and also all original values
>>> df.compare(df2, keep_shape=True, keep_equal=True) col1 col2 col3 self other self other self other 0 a c 1.0 1.0 1.0 1.0 1 a a 2.0 2.0 2.0 2.0 2 b b 3.0 3.0 3.0 4.0 3 b b NaN NaN 4.0 4.0 4 a a 5.0 5.0 5.0 5.0
- convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True)[source]
Convert columns to best possible dtypes using dtypes supporting
pd.NA
.New in version 1.0.0.
- Parameters
- infer_objectsbool, default True
- convert_stringbool, default True
- convert_integerbool, default True
- convert_booleanbool, defaults True
- convert_floatingbool, defaults True
Whether object dtypes should be converted to the best possible types.
Whether object dtypes should be converted to
StringDtype()
.Whether, if possible, conversion can be done to integer extension types.
Whether object dtypes should be converted to
BooleanDtypes()
.Whether, if possible, conversion can be done to floating extension types. If
convert_integer
is also True, preference will be give to integer dtypes if the floats can be faithfully casted to integers.New in version 1.2.0.
- Returns
- Series or DataFrame
Copy of input object with new dtype.
See alsoinfer_objects
to_datetime
to_timedelta
to_numeric
Infer dtypes of objects.
Convert argument to datetime.
Convert argument to timedelta.
Convert argument to a numeric type.
Notes
By default,
convert_dtypes
will attempt to convert a Series (or each Series in a DataFrame) to dtypes that supportpd.NA
. By using the optionsconvert_string
,convert_integer
,convert_boolean
andconvert_boolean
, it is possible to turn off individual conversions toStringDtype
, the integer extension types,BooleanDtype
or floating extension types, respectively.For object-dtyped columns, if
infer_objects
isTrue
, use the inference rules as during normal Series/DataFrame construction. Then, if possible, convert toStringDtype
,BooleanDtype
or an appropriate integer or floating extension type, otherwise leave asobject
.If the dtype is integer, convert to an appropriate integer extension type.
If the dtype is numeric, and consists of all integers, convert to an appropriate integer extension type. Otherwise, convert to an appropriate floating extension type.
Changed in version 1.2:Starting with pandas 1.2, this method also converts float columns to the nullable floating extension type.
In the future, as new dtypes are added that support
pd.NA
, the results of this method will change to support those new dtypes.Examples
>>> df = pd.DataFrame( ... { ... "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")), ... "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")), ... "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")), ... "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")), ... "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")), ... "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")), ... } ... )
Start with a DataFrame with default dtypes.
>>> df a b c d e f 0 1 x True h 10.0 NaN 1 2 y False i NaN 100.5 2 3 z NaN NaN 20.0 200.0
>>> df.dtypes a int32 b object c object d object e float64 f float64 dtype: object
Convert the DataFrame to use best possible dtypes.
>>> dfn = df.convert_dtypes() >>> dfn a b c d e f 0 1 x True h 10 <NA> 1 2 y False i <NA> 100.5 2 3 z <NA> <NA> 20 200.0
>>> dfn.dtypes a Int32 b string c boolean d string e Int64 f Float64 dtype: object
Start with a Series of strings and missing data represented by
np.nan
.>>> s = pd.Series(["a", "b", np.nan]) >>> s 0 a 1 b 2 NaN dtype: object
Obtain a Series with dtype
StringDtype
.>>> s.convert_dtypes() 0 a 1 b 2 <NA> dtype: string
- copy(deep=True)[source]
Make a copy of this object’s indices and data.
When
deep=True
(default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).When
deep=False
, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).- Parameters
- deepbool, default True
Make a deep copy, including a copy of the data and the indices. With
deep=False
neither the indices nor the data are copied.- Returns
- copySeries or DataFrame
Object type matches caller.
Notes
When
deep=True
, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast tocopy.deepcopy
in the Standard Library, which recursively copies object data (see examples below).While
Index
objects are copied whendeep=True
, the underlying numpy array is not copied for performance reasons. SinceIndex
is immutable, the underlying data can be safely shared and a copy is not needed.Examples
>>> s = pd.Series([1, 2], index=["a", "b"]) >>> s a 1 b 2 dtype: int64
>>> s_copy = s.copy() >>> s_copy a 1 b 2 dtype: int64
Shallow copy versus default (deep) copy:
>>> s = pd.Series([1, 2], index=["a", "b"]) >>> deep = s.copy() >>> shallow = s.copy(deep=False)
Shallow copy shares data and index with original.
>>> s is shallow False >>> s.values is shallow.values and s.index is shallow.index True
Deep copy has own copy of data and index.
>>> s is deep False >>> s.values is deep.values or s.index is deep.index False
Updates to the data shared by shallow copy and original is reflected in both; deep copy remains unchanged.
>>> s[0] = 3 >>> shallow[1] = 4 >>> s a 3 b 4 dtype: int64 >>> shallow a 3 b 4 dtype: int64 >>> deep a 1 b 2 dtype: int64
Note that when copying an object containing Python objects, a deep copy will copy the data, but will not do so recursively. Updating a nested data object will be reflected in the deep copy.
>>> s = pd.Series([[1, 2], [3, 4]]) >>> deep = s.copy() >>> s[0][0] = 10 >>> s 0 [10, 2] 1 [3, 4] dtype: object >>> deep 0 [10, 2] 1 [3, 4] dtype: object
- corr(method='pearson', min_periods=1)[source]
Compute pairwise correlation of columns, excluding NA/null values.
- Parameters
- method{‘pearson’, ‘kendall’, ‘spearman’} or callable
Method of correlation:
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
- callable: callable with input two 1d ndarrays
and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.
- min_periodsint, optional
Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.
- Returns
- DataFrame
Correlation matrix.
See alsoDataFrame.corrwith
Series.corr
Compute pairwise correlation with another DataFrame or Series.
Compute the correlation between two Series.
Examples
>>> def histogram_intersection(a, b): ... v = np.minimum(a, b).sum().round(decimals=1) ... return v >>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)], ... columns=['dogs', 'cats']) >>> df.corr(method=histogram_intersection) dogs cats dogs 1.0 0.3 cats 0.3 1.0
- corrwith(other, axis=0, drop=False, method='pearson')[source]
Compute pairwise correlation.
Pairwise correlation is computed between rows or columns of DataFrame with rows or columns of Series or DataFrame. DataFrames are first aligned along both axes before computing the correlations.
- Parameters
- otherDataFrame, Series
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- dropbool, default False
- method{‘pearson’, ‘kendall’, ‘spearman’} or callable
Method of correlation:
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
- callable: callable with input two 1d ndarrays
and returning a float.
Object with which to compute correlations.
The axis to use. 0 or ‘index’ to compute column-wise, 1 or ‘columns’ for row-wise.
Drop missing indices from result.
- Returns
- Series
Pairwise correlations.
See alsoDataFrame.corr
Compute pairwise correlation of columns.
- count(axis=0, level=None, numeric_only=False)[source]
Count non-NA cells for each column or row.
The values
None
,NaN
,NaT
, and optionallynumpy.inf
(depending onpandas.options.mode.use_inf_as_na
) are considered NA.- Parameters
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- levelint or str, optional
- numeric_onlybool, default False
If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.
If the axis is a
MultiIndex
(hierarchical), count along a particularlevel
, collapsing into aDataFrame
. Astr
specifies the level name.- Returns
- Series or DataFrame
For each column/row the number of non-NA/null entries. If
level
is specified returns aDataFrame
.
See alsoSeries.count
DataFrame.value_counts
DataFrame.shape
DataFrame.isna
Number of non-NA elements in a Series.
Count unique combinations of columns.
Number of DataFrame rows and columns (including NA elements).
Boolean same-sized DataFrame showing places of NA elements.
Examples
Constructing DataFrame from a dictionary:
>>> df = pd.DataFrame({"Person": ... ["John", "Myla", "Lewis", "John", "Myla"], ... "Age": [24., np.nan, 21., 33, 26], ... "Single": [False, True, True, True, False]}) >>> df Person Age Single 0 John 24.0 False 1 Myla NaN True 2 Lewis 21.0 True 3 John 33.0 True 4 Myla 26.0 False
Notice the uncounted NA values:
>>> df.count() Person 5 Age 4 Single 5 dtype: int64
Counts for each row:
>>> df.count(axis='columns') 0 3 1 2 2 3 3 3 4 3 dtype: int64
- cov(min_periods=None, ddof=1)[source]
Compute pairwise covariance of columns, excluding NA/null values.
Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.
Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as
NaN
.This method is generally used for the analysis of time series data to understand the relationship between different measures across time.
- Parameters
- min_periodsint, optional
- ddofint, default 1
Minimum number of observations required per pair of columns to have a valid result.
Delta degrees of freedom. The divisor used in calculations is
N - ddof
, whereN
represents the number of elements.New in version 1.1.0.
- Returns
- DataFrame
The covariance matrix of the series of the DataFrame.
See alsoSeries.cov
core.window.ExponentialMovingWindow.cov
core.window.Expanding.cov
core.window.Rolling.cov
Compute covariance with another Series.
Exponential weighted sample covariance.
Expanding sample covariance.
Rolling sample covariance.
Notes
Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-ddof.
For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.
However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.
Examples
>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], ... columns=['dogs', 'cats']) >>> df.cov() dogs cats dogs 0.666667 -1.000000 cats -1.000000 1.666667
>>> np.random.seed(42) >>> df = pd.DataFrame(np.random.randn(1000, 5), ... columns=['a', 'b', 'c', 'd', 'e']) >>> df.cov() a b c d e a 0.998438 -0.020161 0.059277 -0.008943 0.014144 b -0.020161 1.059352 -0.008543 -0.024738 0.009826 c 0.059277 -0.008543 1.010670 -0.001486 -0.000271 d -0.008943 -0.024738 -0.001486 0.921297 -0.013692 e 0.014144 0.009826 -0.000271 -0.013692 0.977795
Minimum number of periods
This method also supports an optional
min_periods
keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:>>> np.random.seed(42) >>> df = pd.DataFrame(np.random.randn(20, 3), ... columns=['a', 'b', 'c']) >>> df.loc[df.index[:5], 'a'] = np.nan >>> df.loc[df.index[5:10], 'b'] = np.nan >>> df.cov(min_periods=12) a b c a 0.316741 NaN -0.150812 b NaN 1.248003 0.191417 c -0.150812 0.191417 0.895202
- cummax(axis=None, skipna=True, *args, **kwargs)[source]
Return cumulative maximum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative maximum.
- Parameters
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- skipnabool, default True
- *args, **kwargs
The index or the name of the axis. 0 is equivalent to None or ‘index’.
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns
- Series or DataFrame
Return cumulative maximum of Series or DataFrame.
See alsocore.window.Expanding.max
DataFrame.max
DataFrame.cummax
DataFrame.cummin
DataFrame.cumsum
DataFrame.cumprod
Similar functionality but ignores
NaN
values.Return the maximum over DataFrame axis.
Return cumulative maximum over DataFrame axis.
Return cumulative minimum over DataFrame axis.
Return cumulative sum over DataFrame axis.
Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummax() 0 2.0 1 NaN 2 5.0 3 5.0 4 5.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cummax(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the maximum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cummax() A B 0 2.0 1.0 1 3.0 NaN 2 3.0 1.0
To iterate over columns and find the maximum in each row, use
axis=1
>>> df.cummax(axis=1) A B 0 2.0 2.0 1 3.0 NaN 2 1.0 1.0
- cummin(axis=None, skipna=True, *args, **kwargs)[source]
Return cumulative minimum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative minimum.
- Parameters
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- skipnabool, default True
- *args, **kwargs
The index or the name of the axis. 0 is equivalent to None or ‘index’.
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns
- Series or DataFrame
Return cumulative minimum of Series or DataFrame.
See alsocore.window.Expanding.min
DataFrame.min
DataFrame.cummax
DataFrame.cummin
DataFrame.cumsum
DataFrame.cumprod
Similar functionality but ignores
NaN
values.Return the minimum over DataFrame axis.
Return cumulative maximum over DataFrame axis.
Return cumulative minimum over DataFrame axis.
Return cumulative sum over DataFrame axis.
Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummin() 0 2.0 1 NaN 2 2.0 3 -1.0 4 -1.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cummin(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the minimum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cummin() A B 0 2.0 1.0 1 2.0 NaN 2 1.0 0.0
To iterate over columns and find the minimum in each row, use
axis=1
>>> df.cummin(axis=1) A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
- cumprod(axis=None, skipna=True, *args, **kwargs)[source]
Return cumulative product over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative product.
- Parameters
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- skipnabool, default True
- *args, **kwargs
The index or the name of the axis. 0 is equivalent to None or ‘index’.
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns
- Series or DataFrame
Return cumulative product of Series or DataFrame.
See alsocore.window.Expanding.prod
DataFrame.prod
DataFrame.cummax
DataFrame.cummin
DataFrame.cumsum
DataFrame.cumprod
Similar functionality but ignores
NaN
values.Return the product over DataFrame axis.
Return cumulative maximum over DataFrame axis.
Return cumulative minimum over DataFrame axis.
Return cumulative sum over DataFrame axis.
Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumprod() 0 2.0 1 NaN 2 10.0 3 -10.0 4 -0.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cumprod(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the product in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cumprod() A B 0 2.0 1.0 1 6.0 NaN 2 6.0 0.0
To iterate over columns and find the product in each row, use
axis=1
>>> df.cumprod(axis=1) A B 0 2.0 2.0 1 3.0 NaN 2 1.0 0.0
- cumsum(axis=None, skipna=True, *args, **kwargs)[source]
Return cumulative sum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative sum.
- Parameters
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- skipnabool, default True
- *args, **kwargs
The index or the name of the axis. 0 is equivalent to None or ‘index’.
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns
- Series or DataFrame
Return cumulative sum of Series or DataFrame.
See alsocore.window.Expanding.sum
DataFrame.sum
DataFrame.cummax
DataFrame.cummin
DataFrame.cumsum
DataFrame.cumprod
Similar functionality but ignores
NaN
values.Return the sum over DataFrame axis.
Return cumulative maximum over DataFrame axis.
Return cumulative minimum over DataFrame axis.
Return cumulative sum over DataFrame axis.
Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumsum() 0 2.0 1 NaN 2 7.0 3 6.0 4 6.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cumsum(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the sum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cumsum() A B 0 2.0 1.0 1 5.0 NaN 2 6.0 1.0
To iterate over columns and find the sum in each row, use
axis=1
>>> df.cumsum(axis=1) A B 0 2.0 3.0 1 3.0 NaN 2 1.0 1.0
- describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)[source]
Generate descriptive statistics.
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
NaN
values.Analyzes both numeric and object series, as well as
DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.- Parameters
- percentileslist-like of numbers, optional
- include‘all’, list-like of dtypes or None (default), optional
‘all’ : All columns of the input will be included in the output.
A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit
numpy.number
. To limit it instead to object columns submit thenumpy.object
data type. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To select pandas categorical columns, use'category'
None (default) : The result will include all numeric columns.
- excludelist-like of dtypes or None (default), optional,
A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit
numpy.number
. To exclude object columns submit the data typenumpy.object
. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To exclude pandas categorical columns, use'category'
None (default) : The result will exclude nothing.
- datetime_is_numericbool, default False
The percentiles to include in the output. All should fall between 0 and 1. The default is
[.25, .5, .75]
, which returns the 25th, 50th, and 75th percentiles.A white list of data types to include in the result. Ignored for
Series
. Here are the options:A black list of data types to omit from the result. Ignored for
Series
. Here are the options:Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DataFrame input, this also controls whether datetime columns are included by default.
New in version 1.1.0.
- Returns
- Series or DataFrame
Summary statistics of the Series or Dataframe provided.
See alsoDataFrame.count
DataFrame.max
DataFrame.min
DataFrame.mean
DataFrame.std
DataFrame.select_dtypes
Count number of non-NA/null observations.
Maximum of the values in the object.
Minimum of the values in the object.
Mean of the values.
Standard deviation of the observations.
Subset of a DataFrame including/excluding columns based on their dtype.
Notes
For numeric data, the result’s index will include
count
,mean
,std
,min
,max
as well as lower,50
and upper percentiles. By default the lower percentile is25
and the upper percentile is75
. The50
percentile is the same as the median.For object data (e.g. strings or timestamps), the result’s index will include
count
,unique
,top
, andfreq
. Thetop
is the most common value. Thefreq
is the most common value’s frequency. Timestamps also include thefirst
andlast
items.If multiple object values have the highest count, then the
count
andtop
results will be arbitrarily chosen from among those with the highest count.For mixed data types provided via a
DataFrame
, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. Ifinclude='all'
is provided as an option, the result will include a union of attributes of each type.The
include
andexclude
parameters can be used to limit which columns in aDataFrame
are analyzed for the output. The parameters are ignored when analyzing aSeries
.Examples
Describing a numeric
Series
.>>> s = pd.Series([1, 2, 3]) >>> s.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 dtype: float64
Describing a categorical
Series
.>>> s = pd.Series(['a', 'a', 'b', 'c']) >>> s.describe() count 4 unique 3 top a freq 2 dtype: object
Describing a timestamp
Series
.>>> s = pd.Series([ ... np.datetime64("2000-01-01"), ... np.datetime64("2010-01-01"), ... np.datetime64("2010-01-01") ... ]) >>> s.describe(datetime_is_numeric=True) count 3 mean 2006-09-01 08:00:00 min 2000-01-01 00:00:00 25% 2004-12-31 12:00:00 50% 2010-01-01 00:00:00 75% 2010-01-01 00:00:00 max 2010-01-01 00:00:00 dtype: object
Describing a
DataFrame
. By default only numeric fields are returned.>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), ... 'numeric': [1, 2, 3], ... 'object': ['a', 'b', 'c'] ... }) >>> df.describe() numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Describing all columns of a
DataFrame
regardless of data type.>>> df.describe(include='all') categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN a freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN
Describing a column from a
DataFrame
by accessing it as an attribute.>>> df.numeric.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Name: numeric, dtype: float64
Including only numeric columns in a
DataFrame
description.>>> df.describe(include=[np.number]) numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Including only string columns in a
DataFrame
description.>>> df.describe(include=[object]) object count 3 unique 3 top a freq 1
Including only categorical columns from a
DataFrame
description.>>> df.describe(include=['category']) categorical count 3 unique 3 top d freq 1
Excluding numeric columns from a
DataFrame
description.>>> df.describe(exclude=[np.number]) categorical object count 3 3 unique 3 3 top f a freq 1 1
Excluding object columns from a
DataFrame
description.>>> df.describe(exclude=[object]) categorical numeric count 3 3.0 unique 3 NaN top f NaN freq 1 NaN mean NaN 2.0 std NaN 1.0 min NaN 1.0 25% NaN 1.5 50% NaN 2.0 75% NaN 2.5 max NaN 3.0
- diff(periods=1, axis=0)[source]
First discrete difference of element.
Calculates the difference of a Dataframe element compared with another element in the Dataframe (default is element in previous row).
- Parameters
- periodsint, default 1
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
Periods to shift for calculating difference, accepts negative values.
Take difference over rows (0) or columns (1).
- Returns
- Dataframe
First differences of the Series.
See alsoDataframe.pct_change
Dataframe.shift
Series.diff
Percent change over given number of periods.
Shift index by desired number of periods with an optional time freq.
First discrete difference of object.
Notes
For boolean dtypes, this uses
operator.xor()
rather thanoperator.sub()
. The result is calculated according to current dtype in Dataframe, however dtype of the result is always float64.Examples
Difference with previous row
>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], ... 'b': [1, 1, 2, 3, 5, 8], ... 'c': [1, 4, 9, 16, 25, 36]}) >>> df a b c 0 1 1 1 1 2 1 4 2 3 2 9 3 4 3 16 4 5 5 25 5 6 8 36
>>> df.diff() a b c 0 NaN NaN NaN 1 1.0 0.0 3.0 2 1.0 1.0 5.0 3 1.0 1.0 7.0 4 1.0 2.0 9.0 5 1.0 3.0 11.0
Difference with previous column
>>> df.diff(axis=1) a b c 0 NaN 0 0 1 NaN -1 3 2 NaN -1 7 3 NaN -1 13 4 NaN 0 20 5 NaN 2 28
Difference with 3rd previous row
>>> df.diff(periods=3) a b c 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 3.0 2.0 15.0 4 3.0 4.0 21.0 5 3.0 6.0 27.0
Difference with following row
>>> df.diff(periods=-1) a b c 0 -1.0 0.0 -3.0 1 -1.0 -1.0 -5.0 2 -1.0 -1.0 -7.0 3 -1.0 -2.0 -9.0 4 -1.0 -3.0 -11.0 5 NaN NaN NaN
Overflow in input dtype
>>> df = pd.DataFrame({'a': [1, 0]}, dtype=np.uint8) >>> df.diff() a 0 NaN 1 255.0
- div(other, axis='columns', level=None, fill_value=None)[source]
Get Floating division of dataframe and other, element-wise (binary operator
truediv
).Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version,rtruediv
.Among flexible wrappers (
add
,sub
,mul
,div
,mod
,pow
) to arithmetic operators:+
,-
,*
,/
,//
,%
,**
.- Parameters
- otherscalar, sequence, Series, or DataFrame
- axis{0 or ‘index’, 1 or ‘columns’}
- levelint or label
- fill_valuefloat or None, default None
Any single or multiple element data structure, or list-like object.
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
Broadcast across a level, matching Index values on the passed MultiIndex level.
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
- DataFrame
Result of the arithmetic operation.
See alsoDataFrame.add
DataFrame.sub
DataFrame.mul
DataFrame.div
DataFrame.truediv
DataFrame.floordiv
DataFrame.mod
DataFrame.pow
Add DataFrames.
Subtract DataFrames.
Multiply DataFrames.
Divide DataFrames (float division).
Divide DataFrames (float division).
Divide DataFrames (integer division).
Calculate modulo (remainder after division).
Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- divide(other, axis='columns', level=None, fill_value=None)[source]
Get Floating division of dataframe and other, element-wise (binary operator
truediv
).Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version,rtruediv
.Among flexible wrappers (
add
,sub
,mul
,div
,mod
,pow
) to arithmetic operators:+
,-
,*
,/
,//
,%
,**
.- Parameters
- otherscalar, sequence, Series, or DataFrame
- axis{0 or ‘index’, 1 or ‘columns’}
- levelint or label
- fill_valuefloat or None, default None
Any single or multiple element data structure, or list-like object.
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
Broadcast across a level, matching Index values on the passed MultiIndex level.
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
- DataFrame
Result of the arithmetic operation.
See alsoDataFrame.add
DataFrame.sub
DataFrame.mul
DataFrame.div
DataFrame.truediv
DataFrame.floordiv
DataFrame.mod
DataFrame.pow
Add DataFrames.
Subtract DataFrames.
Multiply DataFrames.
Divide DataFrames (float division).
Divide DataFrames (float division).
Divide DataFrames (integer division).
Calculate modulo (remainder after division).
Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- dot(other)[source]
Compute the matrix multiplication between the DataFrame and other.
This method computes the matrix product between the DataFrame and the values of an other Series, DataFrame or a numpy array.
It can also be called using
self @ other
in Python >= 3.5.- Parameters
- otherSeries, DataFrame or array-like
The other object to compute the matrix product with.
- Returns
- Series or DataFrame
If other is a Series, return the matrix product between self and other as a Series. If other is a DataFrame or a numpy.array, return the matrix product of self and other in a DataFrame of a np.array.
See alsoSeries.dot
Similar method for Series.
Notes
The dimensions of DataFrame and other must be compatible in order to compute the matrix multiplication. In addition, the column names of DataFrame and the index of other must contain the same values, as they will be aligned prior to the multiplication.
The dot method for Series computes the inner product, instead of the matrix product here.
Examples
Here we multiply a DataFrame with a Series.
>>> df = pd.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]]) >>> s = pd.Series([1, 1, 2, 1]) >>> df.dot(s) 0 -4 1 5 dtype: int64
Here we multiply a DataFrame with another DataFrame.
>>> other = pd.DataFrame([[0, 1], [1, 2], [-1, -1], [2, 0]]) >>> df.dot(other) 0 1 0 1 4 1 2 2
Note that the dot method give the same result as @
>>> df @ other 0 1 0 1 4 1 2 2
The dot method works also if other is an np.array.
>>> arr = np.array([[0, 1], [1, 2], [-1, -1], [2, 0]]) >>> df.dot(arr) 0 1 0 1 4 1 2 2
Note how shuffling of the objects does not change the result.
>>> s2 = s.reindex([1, 0, 2, 3]) >>> df.dot(s2) 0 -4 1 5 dtype: int64
- drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')[source]
Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the
user guide
for more information about the now unused levels.- Parameters
- labelssingle label or list-like
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- indexsingle label or list-like
- columnssingle label or list-like
- levelint or level name, optional
- inplacebool, default False
- errors{‘ignore’, ‘raise’}, default ‘raise’
Index or column labels to drop.
Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
Alternative to specifying axis (
labels, axis=0
is equivalent toindex=labels
).Alternative to specifying axis (
labels, axis=1
is equivalent tocolumns=labels
).For MultiIndex, level from which the labels will be removed.
If False, return a copy. Otherwise, do operation inplace and return None.
If ‘ignore’, suppress error and only existing labels are dropped.
- Returns
- DataFrame or None
DataFrame without the removed index or column labels or None if
inplace=True
.- Raises
- KeyError
If any of the labels is not found in the selected axis.
See alsoDataFrame.loc
DataFrame.dropna
DataFrame.drop_duplicates
Series.drop
Label-location based indexer for selection by label.
Return DataFrame with labels on given axis omitted where (all or any) data are missing.
Return DataFrame with duplicate rows removed, optionally only considering certain columns.
Return Series with specified index labels removed.
Examples
>>> df = pd.DataFrame(np.arange(12).reshape(3, 4), ... columns=['A', 'B', 'C', 'D']) >>> df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11
Drop columns
>>> df.drop(['B', 'C'], axis=1) A D 0 0 3 1 4 7 2 8 11
>>> df.drop(columns=['B', 'C']) A D 0 0 3 1 4 7 2 8 11
Drop a row by index
>>> df.drop([0, 1]) A B C D 2 8 9 10 11
Drop columns and/or rows of MultiIndex DataFrame
>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'], ... ['speed', 'weight', 'length']], ... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [0, 1, 2, 0, 1, 2, 0, 1, 2]]) >>> df = pd.DataFrame(index=midx, columns=['big', 'small'], ... data=[[45, 30], [200, 100], [1.5, 1], [30, 20], ... [250, 150], [1.5, 0.8], [320, 250], ... [1, 0.8], [0.3, 0.2]]) >>> df big small lama speed 45.0 30.0 weight 200.0 100.0 length 1.5 1.0 cow speed 30.0 20.0 weight 250.0 150.0 length 1.5 0.8 falcon speed 320.0 250.0 weight 1.0 0.8 length 0.3 0.2
>>> df.drop(index='cow', columns='small') big lama speed 45.0 weight 200.0 length 1.5 falcon speed 320.0 weight 1.0 length 0.3
>>> df.drop(index='length', level=1) big small lama speed 45.0 30.0 weight 200.0 100.0 cow speed 30.0 20.0 weight 250.0 150.0 falcon speed 320.0 250.0 weight 1.0 0.8
- drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)[source]
Return DataFrame with duplicate rows removed.
Considering certain columns is optional. Indexes, including time indexes are ignored.
- Parameters
- subsetcolumn label or sequence of labels, optional
- keep{‘first’, ‘last’, False}, default ‘first’
- inplacebool, default False
- ignore_indexbool, default False
Only consider certain columns for identifying duplicates, by default use all of the columns.
Determines which duplicates (if any) to keep. -
first
: Drop duplicates except for the first occurrence. -last
: Drop duplicates except for the last occurrence. - False : Drop all duplicates.Whether to drop duplicates in place or to return a copy.
If True, the resulting axis will be labeled 0, 1, …, n - 1.
New in version 1.0.0.
- Returns
- DataFrame or None
DataFrame with duplicates removed or None if
inplace=True
.
See alsoDataFrame.value_counts
Count unique combinations of columns.
Examples
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({ ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], ... 'rating': [4, 4, 3.5, 15, 5] ... }) >>> df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
By default, it removes duplicate rows based on all columns.
>>> df.drop_duplicates() brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
To remove duplicates on specific column(s), use
subset
.>>> df.drop_duplicates(subset=['brand']) brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5
To remove duplicates and keep last occurrences, use
keep
.>>> df.drop_duplicates(subset=['brand', 'style'], keep='last') brand style rating 1 Yum Yum cup 4.0 2 Indomie cup 3.5 4 Indomie pack 5.0
- droplevel(level, axis=0)[source]
Return Series/DataFrame with requested index / column level(s) removed.
- Parameters
- levelint, str, or list-like
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’: remove level(s) in column.
1 or ‘columns’: remove level(s) in row.
If a string is given, must be the name of a level If list-like, elements must be names or positional indexes of levels.
Axis along which the level(s) is removed:
- Returns
- Series/DataFrame
Series/DataFrame with requested index / column level(s) removed.
Examples
>>> df = pd.DataFrame([ ... [1, 2, 3, 4], ... [5, 6, 7, 8], ... [9, 10, 11, 12] ... ]).set_index([0, 1]).rename_axis(['a', 'b'])
>>> df.columns = pd.MultiIndex.from_tuples([ ... ('c', 'e'), ('d', 'f') ... ], names=['level_1', 'level_2'])
>>> df level_1 c d level_2 e f a b 1 2 3 4 5 6 7 8 9 10 11 12
>>> df.droplevel('a') level_1 c d level_2 e f b 2 3 4 6 7 8 10 11 12
>>> df.droplevel('level_2', axis=1) level_1 c d a b 1 2 3 4 5 6 7 8 9 10 11 12
- dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)[source]
Remove missing values.
See the User Guide for more on which values are considered missing, and how to work with missing data.
- Parameters
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.
- how{‘any’, ‘all’}, default ‘any’
‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.
- threshint, optional
- subsetarray-like, optional
- inplacebool, default False
Determine if rows or columns which contain missing values are removed.
Changed in version 1.0.0:Pass tuple or list to drop on multiple axes. Only a single axis is allowed.
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
Require that many non-NA values.
Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
If True, do operation inplace and return None.
- Returns
- DataFrame or None
DataFrame with NA entries dropped from it or None if
inplace=True
.
See alsoDataFrame.isna
DataFrame.notna
DataFrame.fillna
Series.dropna
Index.dropna
Indicate missing values.
Indicate existing (non-missing) values.
Replace missing values.
Drop missing values.
Drop missing indices.
Examples
>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'], ... "toy": [np.nan, 'Batmobile', 'Bullwhip'], ... "born": [pd.NaT, pd.Timestamp("1940-04-25"), ... pd.NaT]}) >>> df name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
>>> df.dropna() name toy born 1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
>>> df.dropna(axis='columns') name 0 Alfred 1 Batman 2 Catwoman
Drop the rows where all elements are missing.
>>> df.dropna(how='all') name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Keep only the rows with at least 2 non-NA values.
>>> df.dropna(thresh=2) name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Define in which columns to look for missing values.
>>> df.dropna(subset=['name', 'toy']) name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Keep the DataFrame with valid entries in the same variable.
>>> df.dropna(inplace=True) >>> df name toy born 1 Batman Batmobile 1940-04-25
- property dtypes
Return the dtypes in the DataFrame.
This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the
object
dtype. See the User Guide for more.- Returns
- pandas.Series
The data type of each column.
Examples
>>> df = pd.DataFrame({'float': [1.0], ... 'int': [1], ... 'datetime': [pd.Timestamp('20180310')], ... 'string': ['foo']}) >>> df.dtypes float float64 int int64 datetime datetime64[ns] string object dtype: object
- duplicated(subset=None, keep='first')[source]
Return boolean Series denoting duplicate rows.
Considering certain columns is optional.
- Parameters
- subsetcolumn label or sequence of labels, optional
- keep{‘first’, ‘last’, False}, default ‘first’
first
: Mark duplicates asTrue
except for the first occurrence.last
: Mark duplicates asTrue
except for the last occurrence.False : Mark all duplicates as
True
.
Only consider certain columns for identifying duplicates, by default use all of the columns.
Determines which duplicates (if any) to mark.
- Returns
- Series
Boolean series for each duplicated rows.
See alsoIndex.duplicated
Series.duplicated
Series.drop_duplicates
DataFrame.drop_duplicates
Equivalent method on index.
Equivalent method on Series.
Remove duplicate values from Series.
Remove duplicate values from DataFrame.
Examples
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({ ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], ... 'rating': [4, 4, 3.5, 15, 5] ... }) >>> df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
By default, for each set of duplicated values, the first occurrence is set on False and all others on True.
>>> df.duplicated() 0 False 1 True 2 False 3 False 4 False dtype: bool
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True.
>>> df.duplicated(keep='last') 0 True 1 False 2 False 3 False 4 False dtype: bool
By setting
keep
on False, all duplicates are True.>>> df.duplicated(keep=False) 0 True 1 True 2 False 3 False 4 False dtype: bool
To find duplicates on specific column(s), use
subset
.>>> df.duplicated(subset=['brand']) 0 False 1 True 2 False 3 True 4 True dtype: bool
- property empty: bool
Indicator whether DataFrame is empty.
True if DataFrame is entirely empty (no items), meaning any of the axes are of length 0.
- Returns
- bool
If DataFrame is empty, return True, if not return False.
See alsoSeries.dropna
DataFrame.dropna
Return series without null values.
Return DataFrame with labels on given axis omitted where (all or any) data are missing.
Notes
If DataFrame contains only NaNs, it is still not considered empty. See the example below.
Examples
An example of an actual empty DataFrame. Notice the index is empty:
>>> df_empty = pd.DataFrame({'A' : []}) >>> df_empty Empty DataFrame Columns: [A] Index: [] >>> df_empty.empty True
If we only have NaNs in our DataFrame, it is not considered empty! We will need to drop the NaNs to make the DataFrame empty:
>>> df = pd.DataFrame({'A' : [np.nan]}) >>> df A 0 NaN >>> df.empty False >>> df.dropna().empty True
- eq(other, axis='columns', level=None)[source]
Get Equal to of dataframe and other, element-wise (binary operator
eq
).Among flexible wrappers (
eq
,ne
,le
,lt
,ge
,gt
) to comparison operators.Equivalent to
==
,=
,<=
,<
,>=
,>
with support to choose axis (rows or columns) and level for comparison.- Parameters
- otherscalar, sequence, Series, or DataFrame
- axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’
- levelint or label
Any single or multiple element data structure, or list-like object.
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns
- DataFrame of bool
Result of the comparison.
See alsoDataFrame.eq
DataFrame.ne
DataFrame.le
DataFrame.lt
DataFrame.ge
DataFrame.gt
Compare DataFrames for equality elementwise.
Compare DataFrames for inequality elementwise.
Compare DataFrames for less than inequality or equality elementwise.
Compare DataFrames for strictly less than inequality elementwise.
Compare DataFrames for greater than inequality or equality elementwise.
Compare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together.
NaN
values are considered different (i.e.NaN
!=NaN
).Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When
other
is aSeries
, the columns of a DataFrame are aligned with the index ofother
and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in
other
:>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- equals(other)[source]
Test whether two objects contain the same elements.
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
The row/column index do not need to have the same type, as long as the values are considered equal. Corresponding columns must be of the same dtype.
- Parameters
- otherSeries or DataFrame
The other Series or DataFrame to be compared with the first.
- Returns
- bool
True if all elements are the same in both objects, False otherwise.
See alsoSeries.eq
DataFrame.eq
testing.assert_series_equal
testing.assert_frame_equal
numpy.array_equal
Compare two Series objects of the same length and return a Series where each element is True if the element in each Series is equal, False otherwise.
Compare two DataFrame objects of the same shape and return a DataFrame where each element is True if the respective element in each DataFrame is equal, False otherwise.
Raises an AssertionError if left and right are not equal. Provides an easy interface to ignore inequality in dtypes, indexes and precision among others.
Like assert_series_equal, but targets DataFrames.
Return True if two arrays have the same shape and elements, False otherwise.
Examples
>>> df = pd.DataFrame({1: [10], 2: [20]}) >>> df 1 2 0 10 20
DataFrames df and exactly_equal have the same types and values for their elements and column labels, which will return True.
>>> exactly_equal = pd.DataFrame({1: [10], 2: [20]}) >>> exactly_equal 1 2 0 10 20 >>> df.equals(exactly_equal) True
DataFrames df and different_column_type have the same element types and values, but have different types for the column labels, which will still return True.
>>> different_column_type = pd.DataFrame({1.0: [10], 2.0: [20]}) >>> different_column_type 1.0 2.0 0 10 20 >>> df.equals(different_column_type) True
DataFrames df and different_data_type have different types for the same values for their elements, and will return False even though their column labels are the same values and types.
>>> different_data_type = pd.DataFrame({1: [10.0], 2: [20.0]}) >>> different_data_type 1 2 0 10.0 20.0 >>> df.equals(different_data_type) False
- eval(expr, inplace=False, **kwargs)[source]
Evaluate a string describing operations on DataFrame columns.
Operates on columns only, not specific rows or elements. This allows
eval
to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.- Parameters
- exprstr
- inplacebool, default False
- **kwargs
The expression string to evaluate.
If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned.
See the documentation for
eval()
for complete details on the keyword arguments accepted byquery()
.- Returns
- ndarray, scalar, pandas object, or None
The result of the evaluation or None if
inplace=True
.
See alsoDataFrame.query
DataFrame.assign
eval
Evaluates a boolean expression to query the columns of a frame.
Can evaluate an expression or function to create new values for a column.
Evaluate a Python expression as a string using various backends.
Notes
For more details see the API documentation for
eval()
. For detailed examples see enhancing performance with eval.Examples
>>> df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)}) >>> df A B 0 1 10 1 2 8 2 3 6 3 4 4 4 5 2 >>> df.eval('A + B') 0 11 1 10 2 9 3 8 4 7 dtype: int64
Assignment is allowed though by default the original DataFrame is not modified.
>>> df.eval('C = A + B') A B C 0 1 10 11 1 2 8 10 2 3 6 9 3 4 4 8 4 5 2 7 >>> df A B 0 1 10 1 2 8 2 3 6 3 4 4 4 5 2
Use
inplace=True
to modify the original DataFrame.>>> df.eval('C = A + B', inplace=True) >>> df A B C 0 1 10 11 1 2 8 10 2 3 6 9 3 4 4 8 4 5 2 7
Multiple columns can be assigned to using multi-line expressions:
>>> df.eval( ... ''' ... C = A + B ... D = A - B ... ''' ... ) A B C D 0 1 10 11 -9 1 2 8 10 -6 2 3 6 9 -3 3 4 4 8 0 4 5 2 7 3
- ewm(com=None, span=None, halflife=None, alpha=None, min_periods=0, adjust=True, ignore_na=False, axis=0, times=None)[source]
Provide exponential weighted (EW) functions.
Available EW functions:
mean()
,var()
,std()
,corr()
,cov()
.Exactly one parameter:
com
,span
,halflife
, oralpha
must be provided.- Parameters
- comfloat, optional
- spanfloat, optional
- halflifefloat, str, timedelta, optional
- alphafloat, optional
- min_periodsint, default 0
- adjustbool, default True
When
adjust=True
(default), the EW function is calculated using weights \(w_i = (1 - \alpha)^i\). For example, the EW moving average of the series [\(x_0, x_1, ..., x_t\)] would be:When
adjust=False
, the exponentially weighted function is calculated recursively:- ignore_nabool, default False
When
ignore_na=False
(default), weights are based on absolute positions. For example, the weights of \(x_0\) and \(x_2\) used in calculating the final weighted average of [\(x_0\), None, \(x_2\)] are \((1-\alpha)^2\) and \(1\) ifadjust=True
, and \((1-\alpha)^2\) and \(\alpha\) ifadjust=False
.When
ignore_na=True
(reproducing pre-0.15.0 behavior), weights are based on relative positions. For example, the weights of \(x_0\) and \(x_2\) used in calculating the final weighted average of [\(x_0\), None, \(x_2\)] are \(1-\alpha\) and \(1\) ifadjust=True
, and \(1-\alpha\) and \(\alpha\) ifadjust=False
.- axis{0, 1}, default 0
- timesstr, np.ndarray, Series, default None
Specify decay in terms of center of mass, \(\alpha = 1 / (1 + com)\), for \(com \geq 0\).
Specify decay in terms of span, \(\alpha = 2 / (span + 1)\), for \(span \geq 1\).
Specify decay in terms of half-life, \(\alpha = 1 - \exp\left(-\ln(2) / halflife\right)\), for \(halflife > 0\).
If
times
is specified, the time unit (str or timedelta) over which an observation decays to half its value. Only applicable tomean()
and halflife value will not apply to the other functions.New in version 1.1.0.
Specify smoothing factor \(\alpha\) directly, \(0 < \alpha \leq 1\).
Minimum number of observations in window required to have a value (otherwise result is NA).
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings (viewing EWMA as a moving average).
Ignore missing values when calculating weights; specify
True
to reproduce pre-0.15.0 behavior.The axis to use. The value 0 identifies the rows, and 1 identifies the columns.
New in version 1.1.0.
Times corresponding to the observations. Must be monotonically increasing and
datetime64[ns]
dtype.If str, the name of the column in the DataFrame representing the times.
If 1-D array like, a sequence with the same shape as the observations.
Only applicable to
mean()
.- Returns
- DataFrame
A Window sub-classed for the particular operation.
Notes
More details can be found at: Exponentially weighted windows.
Examples
>>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]}) >>> df B 0 0.0 1 1.0 2 2.0 3 NaN 4 4.0
>>> df.ewm(com=0.5).mean() B 0 0.000000 1 0.750000 2 1.615385 3 1.615385 4 3.670213
Specifying
times
with a timedeltahalflife
when computing mean.>>> times = ['2020-01-01', '2020-01-03', '2020-01-10', '2020-01-15', '2020-01-17'] >>> df.ewm(halflife='4 days', times=pd.DatetimeIndex(times)).mean() B 0 0.000000 1 0.585786 2 1.523889 3 1.523889 4 3.233686
- expanding(min_periods=1, center=None, axis=0, method='single')[source]
Provide expanding transformations.
- Parameters
- min_periodsint, default 1
- centerbool, default False
- axisint or str, default 0
- methodstr {‘single’, ‘table’}, default ‘single’
Minimum number of observations in window required to have a value (otherwise result is NA).
Set the labels at the center of the window.
Execute the rolling operation per single column or row (
'single'
) or over the entire object ('table'
).This argument is only implemented when specifying
engine='numba'
in the method call.New in version 1.3.0.
- Returns
- a Window sub-classed for the particular operation
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.Examples
>>> df = pd.DataFrame({"B": [0, 1, 2, np.nan, 4]}) >>> df B 0 0.0 1 1.0 2 2.0 3 NaN 4 4.0
>>> df.expanding(2).sum() B 0 NaN 1 1.0 2 3.0 3 3.0 4 7.0
- explode(column, ignore_index=False)[source]
Transform each element of a list-like to a row, replicating index values.
New in version 0.25.0.
- Parameters
- columnIndexLabel
- ignore_indexbool, default False
Column(s) to explode. For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length.
New in version 1.3.0:Multi-column explode
If True, the resulting index will be labeled 0, 1, …, n - 1.
New in version 1.1.0.
- Returns
- DataFrame
Exploded lists to rows of the subset columns; index will be duplicated for these rows.
- Raises
- ValueError
If columns of the frame are not unique.
If specified columns to explode is empty list.
If specified columns to explode have not matching count of elements rowwise in the frame.
See alsoDataFrame.unstack
DataFrame.melt
Series.explode
Pivot a level of the (necessarily hierarchical) index labels.
Unpivot a DataFrame from wide format to long format.
Explode a DataFrame from list-like columns to long format.
Notes
This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of rows in the output will be non-deterministic when exploding sets.
Examples
>>> df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]], ... 'B': 1, ... 'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]}) >>> df A B C 0 [0, 1, 2] 1 [a, b, c] 1 foo 1 NaN 2 [] 1 [] 3 [3, 4] 1 [d, e]
Single-column explode.
>>> df.explode('A') A B C 0 0 1 [a, b, c] 0 1 1 [a, b, c] 0 2 1 [a, b, c] 1 foo 1 NaN 2 NaN 1 [] 3 3 1 [d, e] 3 4 1 [d, e]
Multi-column explode.
>>> df.explode(list('AC')) A B C 0 0 1 a 0 1 1 b 0 2 1 c 1 foo 1 NaN 2 NaN 1 NaN 3 3 1 d 3 4 1 e
- ffill(axis=None, inplace=False, limit=None, downcast=None)[source]
Synonym for
DataFrame.fillna()
withmethod='ffill'
.- Returns
- Series/DataFrame or None
Object with missing values filled or None if
inplace=True
.
- fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)[source]
Fill NA/NaN values using the specified method.
- Parameters
- valuescalar, dict, Series, or DataFrame
- method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
- axis{0 or ‘index’, 1 or ‘columns’}
- inplacebool, default False
- limitint, default None
- downcastdict, default is None
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.
Axis along which to fill missing values.
If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).
- Returns
- DataFrame or None
Object with missing values filled or None if
inplace=True
.
See alsointerpolate
reindex
asfreq
Fill NaN values using interpolation.
Conform object to new index.
Convert TimeSeries to specified frequency.
Examples
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], ... [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, 5], ... [np.nan, 3, np.nan, 4]], ... columns=list("ABCD")) >>> df A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4
Replace all NaN elements with 0s.
>>> df.fillna(0) A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4
We can also propagate non-null values forward or backward.
>>> df.fillna(method="ffill") A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 3.0 4.0 NaN 5 3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
>>> values = {"A": 0, "B": 1, "C": 2, "D": 3} >>> df.fillna(value=values) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 2.0 1 2 0.0 1.0 2.0 5 3 0.0 3.0 2.0 4
Only replace the first NaN element.
>>> df.fillna(value=values, limit=1) A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 NaN 1 2 NaN 1.0 NaN 5 3 NaN 3.0 NaN 4
When filling using a DataFrame, replacement happens along the same column names and same indices
>>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE")) >>> df.fillna(df2) A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4
- filter(items=None, like=None, regex=None, axis=None)[source]
Subset the dataframe rows or columns according to the specified index labels.
Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.
- Parameters
- itemslist-like
- likestr
- regexstr (regular expression)
- axis{0 or ‘index’, 1 or ‘columns’, None}, default None
Keep labels from axis which are in items.
Keep labels from axis for which “like in label == True”.
Keep labels from axis for which re.search(regex, label) == True.
The axis to filter on, expressed either as an index (int) or axis name (str). By default this is the info axis, ‘index’ for Series, ‘columns’ for DataFrame.
- Returns
- same type as input object
See alsoDataFrame.loc
Access a group of rows and columns by label(s) or a boolean array.
Notes
The
items
,like
, andregex
parameters are enforced to be mutually exclusive.axis
defaults to the info axis that is used when indexing with[]
.Examples
>>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])), ... index=['mouse', 'rabbit'], ... columns=['one', 'two', 'three']) >>> df one two three mouse 1 2 3 rabbit 4 5 6
>>> # select columns by name >>> df.filter(items=['one', 'three']) one three mouse 1 3 rabbit 4 6
>>> # select columns by regular expression >>> df.filter(regex='e$', axis=1) one three mouse 1 3 rabbit 4 6
>>> # select rows containing 'bbi' >>> df.filter(like='bbi', axis=0) one two three rabbit 4 5 6
- first(offset)[source]
Select initial periods of time series data based on a date offset.
When having a DataFrame with dates as index, this function can select the first few rows based on a date offset.
- Parameters
- offsetstr, DateOffset or dateutil.relativedelta
The offset length of the data that will be selected. For instance, ‘1M’ will display all the rows having their index within the first month.
- Returns
- Series or DataFrame
A subset of the caller.
- Raises
- TypeError
If the index is not a
DatetimeIndex
See alsolast
at_time
between_time
Select final periods of time series based on a date offset.
Select values at a particular time of the day.
Select values between particular times of the day.
Examples
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4
Get the rows for the first 3 days:
>>> ts.first('3D') A 2018-04-09 1 2018-04-11 2
Notice the data for 3 first calendar days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.
- first_valid_index()[source]
Return index for first non-NA value or None, if no NA value is found.
- Returns
- scalartype of index
Notes
If all elements are non-NA/null, returns None. Also returns None for empty Series/DataFrame.
- property flags: pandas.core.flags.Flags
Get the properties associated with this pandas object.
The available flags are
Flags.allows_duplicate_labels
See alsoFlags
DataFrame.attrs
Flags that apply to pandas objects.
Global metadata applying to this dataset.
Notes
“Flags” differ from “metadata”. Flags reflect properties of the pandas object (the Series or DataFrame). Metadata refer to properties of the dataset, and should be stored in
DataFrame.attrs
.Examples
>>> df = pd.DataFrame({"A": [1, 2]}) >>> df.flags <Flags(allows_duplicate_labels=True)>
Flags can be get or set using
.
>>> df.flags.allows_duplicate_labels True >>> df.flags.allows_duplicate_labels = False
Or by slicing with a key
>>> df.flags["allows_duplicate_labels"] False >>> df.flags["allows_duplicate_labels"] = True
- floordiv(other, axis='columns', level=None, fill_value=None)[source]
Get Integer division of dataframe and other, element-wise (binary operator
floordiv
).Equivalent to
dataframe // other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version,rfloordiv
.Among flexible wrappers (
add
,sub
,mul
,div
,mod
,pow
) to arithmetic operators:+
,-
,*
,/
,//
,%
,**
.- Parameters
- otherscalar, sequence, Series, or DataFrame
- axis{0 or ‘index’, 1 or ‘columns’}
- levelint or label
- fill_valuefloat or None, default None
Any single or multiple element data structure, or list-like object.
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
Broadcast across a level, matching Index values on the passed MultiIndex level.
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
- DataFrame
Result of the arithmetic operation.
See alsoDataFrame.add
DataFrame.sub
DataFrame.mul
DataFrame.div
DataFrame.truediv
DataFrame.floordiv
DataFrame.mod
DataFrame.pow
Add DataFrames.
Subtract DataFrames.
Multiply DataFrames.
Divide DataFrames (float division).
Divide DataFrames (float division).
Divide DataFrames (integer division).
Calculate modulo (remainder after division).
Calculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- classmethod from_dict(data, orient='columns', dtype=None, columns=None)[source]
Construct DataFrame from dict of array-like or dicts.
Creates DataFrame object from dictionary by columns or by index allowing dtype specification.
- Parameters
- datadict
- orient{‘columns’, ‘index’}, default ‘columns’
- dtypedtype, default None
- columnslist, default None
Of the form {field : array-like} or {field : dict}.
The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
Data type to force, otherwise infer.
Column labels to use when
orient='index'
. Raises a ValueError if used withorient='columns'
.- Returns
- DataFrame
See alsoDataFrame.from_records
DataFrame
DataFrame from structured ndarray, sequence of tuples or dicts, or DataFrame.
DataFrame object creation using constructor.
Examples
By default the keys of the dict become the DataFrame columns:
>>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']} >>> pd.DataFrame.from_dict(data) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
Specify
orient='index'
to create the DataFrame using dictionary keys as rows:>>> data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']} >>> pd.DataFrame.from_dict(data, orient='index') 0 1 2 3 row_1 3 2 1 0 row_2 a b c d
When using the ‘index’ orientation, the column names can be specified manually:
>>> pd.DataFrame.from_dict(data, orient='index', ... columns=['A', 'B', 'C', 'D']) A B C D row_1 3 2 1 0 row_2 a b c d
- classmethod from_records(data, index=None, exclude=None, columns=None, coerce_float=False, nrows=None)[source]
Convert structured or record ndarray to DataFrame.
Creates a DataFrame object from a structured ndarray, sequence of tuples or dicts, or DataFrame.
- Parameters
- datastructured ndarray, sequence of tuples or dicts, or DataFrame
- indexstr, list of fields, array-like
- excludesequence, default None
- columnssequence, default None
- coerce_floatbool, default False
- nrowsint, default None
Structured input data.
Field of array to use as the index, alternately a specific set of input labels to use.
Columns or fields to exclude.
Column names to use. If the passed data do not have names associated with them, this argument provides names for the columns. Otherwise this argument indicates the order of the columns in the result (any names not found in the data will become all-NA columns).
Attempt to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
Number of rows to read if data is an iterator.
- Returns
- DataFrame
See alsoDataFrame.from_dict
DataFrame
DataFrame from dict of array-like or dicts.
DataFrame object creation using constructor.
Examples
Data can be provided as a structured ndarray:
>>> data = np.array([(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')], ... dtype=[('col_1', 'i4'), ('col_2', 'U1')]) >>> pd.DataFrame.from_records(data) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
Data can be provided as a list of dicts:
>>> data = [{'col_1': 3, 'col_2': 'a'}, ... {'col_1': 2, 'col_2': 'b'}, ... {'col_1': 1, 'col_2': 'c'}, ... {'col_1': 0, 'col_2': 'd'}] >>> pd.DataFrame.from_records(data) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
Data can be provided as a list of tuples with corresponding columns:
>>> data = [(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')] >>> pd.DataFrame.from_records(data, columns=['col_1', 'col_2']) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
- ge(other, axis='columns', level=None)[source]
Get Greater than or equal to of dataframe and other, element-wise (binary operator
ge
).Among flexible wrappers (
eq
,ne
,le
,lt
,ge
,gt
) to comparison operators.Equivalent to
==
,=
,<=
,<
,>=
,>
with support to choose axis (rows or columns) and level for comparison.- Parameters
- otherscalar, sequence, Series, or DataFrame
- axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’
- levelint or label
Any single or multiple element data structure, or list-like object.
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns
- DataFrame of bool
Result of the comparison.
See alsoDataFrame.eq
DataFrame.ne
DataFrame.le
DataFrame.lt
DataFrame.ge
DataFrame.gt
Compare DataFrames for equality elementwise.
Compare DataFrames for inequality elementwise.
Compare DataFrames for less than inequality or equality elementwise.
Compare DataFrames for strictly less than inequality elementwise.
Compare DataFrames for greater than inequality or equality elementwise.
Compare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together.
NaN
values are considered different (i.e.NaN
!=NaN
).Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When
other
is aSeries
, the columns of a DataFrame are aligned with the index ofother
and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in
other
:>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- get(key, default=None)[source]
Get item from object for given key (ex: DataFrame column).
Returns default value if not found.
- Parameters
- keyobject
- Returns
- valuesame type as items contained in object
- groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)[source]
Group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
- Parameters
- bymapping, function, label, or list of labels
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- levelint, level name, or sequence of such, default None
- as_indexbool, default True
- sortbool, default True
- group_keysbool, default True
- squeezebool, default False
- observedbool, default False
- dropnabool, default True
Used to determine the groups for the groupby. If
by
is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()
method). If an ndarray is passed, the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns inself
. Notice that a tuple is interpreted as a (single) key.Split along rows (0) or columns (1).
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
When calling apply, add group keys to index to identify pieces.
Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
Deprecated since version 1.1.0.
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups
New in version 1.1.0.
- Returns
- DataFrameGroupBy
Returns a groupby object that contains information about the groups.
See alsoresample
Convenience method for frequency conversion and resampling of time series.
Notes
See the user guide for more.
Examples
>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', ... 'Parrot', 'Parrot'], ... 'Max Speed': [380., 370., 24., 26.]}) >>> df Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0 >>> df.groupby(['Animal']).mean() Max Speed Animal Falcon 375.0 Parrot 25.0
Hierarchical Indexes
We can groupby different levels of a hierarchical index using the
level
parameter:>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'], ... ['Captive', 'Wild', 'Captive', 'Wild']] >>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type')) >>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]}, ... index=index) >>> df Max Speed Animal Type Falcon Captive 390.0 Wild 350.0 Parrot Captive 30.0 Wild 20.0 >>> df.groupby(level=0).mean() Max Speed Animal Falcon 370.0 Parrot 25.0 >>> df.groupby(level="Type").mean() Max Speed Type Captive 210.0 Wild 185.0
We can also choose to include NA in group keys or not by setting
dropna
parameter, the default setting isTrue
:>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] >>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum() a c b 1.0 2 3 2.0 2 5
>>> df.groupby(by=["b"], dropna=False).sum() a c b 1.0 2 3 2.0 2 5 NaN 1 4
>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]] >>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by="a").sum() b c a a 13.0 13.0 b 12.3 123.0
>>> df.groupby(by="a", dropna=False).sum() b c a a 13.0 13.0 b 12.3 123.0 NaN 12.3 33.0
- gt(other, axis='columns', level=None)[source]
Get Greater than of dataframe and other, element-wise (binary operator
gt
).Among flexible wrappers (
eq
,ne
,le
,lt
,ge
,gt
) to comparison operators.Equivalent to
==
,=
,<=
,<
,>=
,>
with support to choose axis (rows or columns) and level for comparison.- Parameters
- otherscalar, sequence, Series, or DataFrame
- axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’
- levelint or label
Any single or multiple element data structure, or list-like object.
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns
- DataFrame of bool
Result of the comparison.
See alsoDataFrame.eq
DataFrame.ne
DataFrame.le
DataFrame.lt
DataFrame.ge
DataFrame.gt
Compare DataFrames for equality elementwise.
Compare DataFrames for inequality elementwise.
Compare DataFrames for less than inequality or equality elementwise.
Compare DataFrames for strictly less than inequality elementwise.
Compare DataFrames for greater than inequality or equality elementwise.
Compare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together.
NaN
values are considered different (i.e.NaN
!=NaN
).Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When
other
is aSeries
, the columns of a DataFrame are aligned with the index ofother
and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in
other
:>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- head(n=5)[source]
Return the first
n
rows.This function returns the first
n
rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.For negative values of
n
, this function returns all rows except the lastn
rows, equivalent todf[:-n]
.- Parameters
- nint, default 5
Number of rows to select.
- Returns
- same type as caller
The first
n
rows of the caller object.
See alsoDataFrame.tail
Returns the last
n
rows.Examples
>>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion', ... 'monkey', 'parrot', 'shark', 'whale', 'zebra']}) >>> df animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey 5 parrot 6 shark 7 whale 8 zebra
Viewing the first 5 lines
>>> df.head() animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey
Viewing the first
n
lines (three in this case)>>> df.head(3) animal 0 alligator 1 bee 2 falcon
For negative values of
n
>>> df.head(-3) animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey 5 parrot
- hist(column=None, by=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, backend=None, legend=False, **kwargs)[source]
Make a histogram of the DataFrame’s columns.
A histogram is a representation of the distribution of data. This function calls
matplotlib.pyplot.hist()
, on each series in the DataFrame, resulting in one histogram per column.- Parameters
- dataDataFrame
- columnstr or sequence, optional
- byobject, optional
- gridbool, default True
- xlabelsizeint, default None
- xrotfloat, default None
- ylabelsizeint, default None
- yrotfloat, default None
- axMatplotlib axes object, default None
- sharexbool, default True if ax is None else False
- shareybool, default False
- figsizetuple, optional
- layouttuple, optional
- binsint or sequence, default 10
- backendstr, default None
- legendbool, default False
- **kwargs
The pandas object holding the data.
If passed, will be used to limit data to a subset of columns.
If passed, then used to form histograms for separate groups.
Whether to show axis grid lines.
If specified changes the x-axis label size.
Rotation of x axis labels. For example, a value of 90 displays the x labels rotated 90 degrees clockwise.
If specified changes the y-axis label size.
Rotation of y axis labels. For example, a value of 90 displays the y labels rotated 90 degrees clockwise.
The axes to plot the histogram on.
In case subplots=True, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in. Note that passing in both an ax and sharex=True will alter all x axis labels for all subplots in a figure.
In case subplots=True, share y axis and set some y axis labels to invisible.
The size in inches of the figure to create. Uses the value in
matplotlib.rcParams
by default.Tuple of (rows, columns) for the layout of the histograms.
Number of histogram bins to be used. If an integer is given, bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin. In this case, bins is returned unmodified.
Backend to use instead of the backend specified in the option
plotting.backend
. For instance, ‘matplotlib’. Alternatively, to specify theplotting.backend
for the whole session, setpd.options.plotting.backend
.New in version 1.0.0.
Whether to show the legend.
New in version 1.1.0.
All other plotting keyword arguments to be passed to
matplotlib.pyplot.hist()
.- Returns
- matplotlib.AxesSubplot or numpy.ndarray of them
See alsomatplotlib.pyplot.hist
Plot a histogram using matplotlib.
Examples
This example draws a histogram based on the length and width of some animals, displayed in three bins
- property iat: pandas.core.indexing._iAtIndexer
Access a single value for a row/column pair by integer position.
Similar to
iloc
, in that both provide integer-based lookups. Useiat
if you only need to get or set a single value in a DataFrame or Series.- Raises
- IndexError
When integer position is out of bounds.
See alsoDataFrame.at
DataFrame.loc
DataFrame.iloc
Access a single value for a row/column label pair.
Access a group of rows and columns by label(s).
Access a group of rows and columns by integer position(s).
Examples
>>> df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], ... columns=['A', 'B', 'C']) >>> df A B C 0 0 2 3 1 0 4 1 2 10 20 30
Get value at specified row/column pair
>>> df.iat[1, 2] 1
Set value at specified row/column pair
>>> df.iat[1, 2] = 10 >>> df.iat[1, 2] 10
Get value within a series
>>> df.loc[0].iat[1] 2
- idxmax(axis=0, skipna=True)[source]
Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
- Parameters
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- skipnabool, default True
The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- Returns
- Series
Indexes of maxima along the specified axis.
- Raises
- ValueError
If the row/column is empty
See alsoSeries.idxmax
Return index of the maximum element.
Notes
This method is the DataFrame version of
ndarray.argmax
.Examples
Consider a dataset containing food consumption in Argentina.
>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48], ... 'co2_emissions': [37.2, 19.66, 1712]}, ... index=['Pork', 'Wheat Products', 'Beef'])
>>> df consumption co2_emissions Pork 10.51 37.20 Wheat Products 103.11 19.66 Beef 55.48 1712.00
By default, it returns the index for the maximum value in each column.
>>> df.idxmax() consumption Wheat Products co2_emissions Beef dtype: object
To return the index for the maximum value in each row, use
axis="columns"
.>>> df.idxmax(axis="columns") Pork co2_emissions Wheat Products consumption Beef co2_emissions dtype: object
- idxmin(axis=0, skipna=True)[source]
Return index of first occurrence of minimum over requested axis.
NA/null values are excluded.
- Parameters
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
- skipnabool, default True
The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- Returns
- Series
Indexes of minima along the specified axis.
- Raises
- ValueError
If the row/column is empty
See alsoSeries.idxmin
Return index of the minimum element.
Notes
This method is the DataFrame version of
ndarray.argmin
.Examples
Consider a dataset containing food consumption in Argentina.
>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48], ... 'co2_emissions': [37.2, 19.66, 1712]}, ... index=['Pork', 'Wheat Products', 'Beef'])
>>> df consumption co2_emissions Pork 10.51 37.20 Wheat Products 103.11 19.66 Beef 55.48 1712.00
By default, it returns the index for the minimum value in each column.
>>> df.idxmin() consumption Pork co2_emissions Wheat Products dtype: object
To return the index for the minimum value in each row, use
axis="columns"
.>>> df.idxmin(axis="columns") Pork consumption Wheat Products co2_emissions Beef consumption dtype: object
- property iloc: pandas.core.indexing._iLocIndexer
Purely integer-location based indexing for selection by position.
.iloc[]
is primarily integer position based (from0
tolength-1
of the axis), but may also be used with a boolean array.Allowed inputs are:
An integer, e.g.
5
.A list or array of integers, e.g.
[4, 3, 0]
.A slice object with ints, e.g.
1:7
.A boolean array.
A
callable
function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
.iloc
will raiseIndexError
if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).See more at Selection by Position.
See alsoDataFrame.iat
DataFrame.loc
Series.iloc
Fast integer location scalar accessor.
Purely label-location based indexer for selection by label.
Purely integer-location based indexing for selection by position.
Examples
>>> mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4}, ... {'a': 100, 'b': 200, 'c': 300, 'd': 400}, ... {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }] >>> df = pd.DataFrame(mydict) >>> df a b c d 0 1 2 3 4 1 100 200 300 400 2 1000 2000 3000 4000
Indexing just the rows
With a scalar integer.
>>> type(df.iloc[0]) <class 'pandas.core.series.Series'> >>> df.iloc[0] a 1 b 2 c 3 d 4 Name: 0, dtype: int64
With a list of integers.
>>> df.iloc[[0]] a b c d 0 1 2 3 4 >>> type(df.iloc[[0]]) <class 'pandas.core.frame.DataFrame'>
>>> df.iloc[[0, 1]] a b c d 0 1 2 3 4 1 100 200 300 400
With a
slice
object.>>> df.iloc[:3] a b c d 0 1 2 3 4 1 100 200 300 400 2 1000 2000 3000 4000
With a boolean mask the same length as the index.
>>> df.iloc[[True, False, True]] a b c d 0 1 2 3 4 2 1000 2000 3000 4000
With a callable, useful in method chains. The
x
passed to thelambda
is the DataFrame being sliced. This selects the rows whose index label even.>>> df.iloc[lambda x: x.index % 2 == 0] a b c d 0 1 2 3 4 2 1000 2000 3000 4000
Indexing both axes
You can mix the indexer types for the index and columns. Use
:
to select the entire axis.With scalar integers.
>>> df.iloc[0, 1] 2
With lists of integers.
>>> df.iloc[[0, 2], [1, 3]] b d 0 2 4 2 2000 4000
With
slice
objects.>>> df.iloc[1:3, 0:3] a b c 1 100 200 300 2 1000 2000 3000
With a boolean array whose length matches the columns.
>>> df.iloc[:, [True, False, True, False]] a c 0 1 3 1 100 300 2 1000 3000
With a callable function that expects the Series or DataFrame.
>>> df.iloc[:, lambda df: [0, 2]] a c 0 1 3 1 100 300 2 1000 3000
- index: Index
The index (row labels) of the DataFrame.
- infer_objects()[source]
Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns unchanged. The inference rules are the same as during normal Series/DataFrame construction.
- Returns
- convertedsame type as input object
See alsoto_datetime
to_timedelta
to_numeric
convert_dtypes
Convert argument to datetime.
Convert argument to timedelta.
Convert argument to numeric type.
Convert argument to best possible dtype.
Examples
>>> df = pd.DataFrame({"A": ["a", 1, 2, 3]}) >>> df = df.iloc[1:] >>> df A 1 1 2 2 3 3
>>> df.dtypes A object dtype: object
>>> df.infer_objects().dtypes A int64 dtype: object
- info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None)[source]
Print a concise summary of a DataFrame.
This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
- Parameters
- dataDataFrame
- verbosebool, optional
- bufwritable buffer, defaults to sys.stdout
- max_colsint, optional
- memory_usagebool, str, optional
- show_countsbool, optional
- null_countsbool, optional
DataFrame to print information about.
Whether to print the full summary. By default, the setting in
pandas.options.display.max_info_columns
is followed.Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.
When to switch from the verbose to the truncated output. If the DataFrame has more than
max_cols
columns, the truncated output is used. By default, the setting inpandas.options.display.max_info_columns
is used.Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed. By default, this follows the
pandas.options.display.memory_usage
setting.True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources.
Whether to show the non-null counts. By default, this is shown only if the DataFrame is smaller than
pandas.options.display.max_info_rows
andpandas.options.display.max_info_columns
. A value of True always shows the counts, and False never shows the counts.Deprecated since version 1.2.0:Use show_counts instead.
- Returns
- None
This method prints a summary of a DataFrame and returns None.
See alsoDataFrame.describe
DataFrame.memory_usage
Generate descriptive statistics of DataFrame columns.
Memory usage of DataFrame columns.
Examples
>>> int_values = [1, 2, 3, 4, 5] >>> text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon'] >>> float_values = [0.0, 0.25, 0.5, 0.75, 1.0] >>> df = pd.DataFrame({"int_col": int_values, "text_col": text_values, ... "float_col": float_values}) >>> df int_col text_col float_col 0 1 alpha 0.00 1 2 beta 0.25 2 3 gamma 0.50 3 4 delta 0.75 4 5 epsilon 1.00
Prints information of all columns:
>>> df.info(verbose=True) <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 int_col 5 non-null int64 1 text_col 5 non-null object 2 float_col 5 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 248.0+ bytes
Prints a summary of columns count and its dtypes but not per column information:
>>> df.info(verbose=False) <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Columns: 3 entries, int_col to float_col dtypes: float64(1), int64(1), object(1) memory usage: 248.0+ bytes
Pipe output of DataFrame.info to buffer instead of sys.stdout, get buffer content and writes to a text file:
>>> import io >>> buffer = io.StringIO() >>> df.info(buf=buffer) >>> s = buffer.getvalue() >>> with open("df_info.txt", "w", ... encoding="utf-8") as f: ... f.write(s) 260
The
memory_usage
parameter allows deep introspection mode, specially useful for big DataFrames and fine-tune memory optimization:>>> random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6) >>> df = pd.DataFrame({ ... 'column_1': np.random.choice(['a', 'b', 'c'], 10 ** 6), ... 'column_2': np.random.choice(['a', 'b', 'c'], 10 ** 6), ... 'column_3': np.random.choice(['a', 'b', 'c'], 10 ** 6) ... }) >>> df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 column_1 1000000 non-null object 1 column_2 1000000 non-null object 2 column_3 1000000 non-null object dtypes: object(3) memory usage: 22.9+ MB
>>> df.info(memory_usage='deep') <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 column_1 1000000 non-null object 1 column_2 1000000 non-null object 2 column_3 1000000 non-null object dtypes: object(3) memory usage: 165.9 MB
- insert(loc, column, value, allow_duplicates=False)[source]
Insert column into DataFrame at specified location.
Raises a ValueError if
column
is already contained in the DataFrame, unlessallow_duplicates
is set to True.- Parameters
- locint
- columnstr, number, or hashable object
- valueint, Series, or array-like
- allow_duplicatesbool, optional
Insertion index. Must verify 0 <= loc <= len(columns).
Label of the inserted column.
See alsoIndex.insert
Insert new item by index.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df col1 col2 0 1 3 1 2 4 >>> df.insert(1, "newcol", [99, 99]) >>> df col1 newcol col2 0 1 99 3 1 2 99 4 >>> df.insert(0, "col1", [100, 100], allow_duplicates=True) >>> df col1 col1 newcol col2 0 100 1 99 3 1 100 2 99 4
Notice that pandas uses index alignment in case of
value
from typeSeries
:>>> df.insert(0, "col0", pd.Series([5, 6], index=[1, 2])) >>> df col0 col1 col1 newcol col2 0 NaN 100 1 99 3 1 5.0 100 2 99 4
- interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)[source]
Fill NaN values using an interpolation method.
Please note that only
method='linear'
is supported for DataFrame/Series with a MultiIndex.- Parameters
- methodstr, default ‘linear’
‘linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
‘time’: Works on daily and higher resolution data to interpolate given length of interval.
‘index’, ‘values’: use the actual numerical values of the index.
‘pad’: Fill in NaNs using existing values.
‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’: Passed to
scipy.interpolate.interp1d
. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify anorder
(int), e.g.df.interpolate(method='polynomial', order=5)
.‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See
Notes
.‘from_derivatives’: Refers to
scipy.interpolate.BPoly.from_derivatives
which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.- axis{{0 or ‘index’, 1 or ‘columns’, None}}, default None
- limitint, optional
- inplacebool, default False
- limit_direction{{‘forward’, ‘backward’, ‘both’}}, Optional
Consecutive NaNs will be filled in this direction.
- If limit is specified:
If ‘method’ is ‘pad’ or ‘ffill’, ‘limit_direction’ must be ‘forward’.
If ‘method’ is ‘backfill’ or ‘bfill’, ‘limit_direction’ must be ‘backwards’.
- If ‘limit’ is not specified:
If ‘method’ is ‘backfill’ or ‘bfill’, the default is ‘backward’
else the default is ‘forward’
Changed in version 1.1.0:raises ValueError if
limit_direction
is ‘forward’ or ‘both’ and method is ‘backfill’ or ‘bfill’. raises ValueError iflimit_direction
is ‘backward’ or ‘both’ and method is ‘pad’ or ‘ffill’.- limit_area{{
None
, ‘inside’, ‘outside’}}, default None None
: No fill restriction.‘inside’: Only fill NaNs surrounded by valid values (interpolate).
‘outside’: Only fill NaNs outside valid values (extrapolate).
- downcastoptional, ‘infer’ or None, defaults to None
- ``**kwargs``optional
Interpolation technique to use. One of:
Axis to interpolate along.
Maximum number of consecutive NaNs to fill. Must be greater than 0.
Update the data in place if possible.
If limit is specified, consecutive NaNs will be filled with this restriction.
Downcast dtypes if possible.
Keyword arguments to pass on to the interpolating function.
- Returns
- Series or DataFrame or None
Returns the same object type as the caller, interpolated at some or all
NaN
values or None ifinplace=True
.
See alsofillna
scipy.interpolate.Akima1DInterpolator
scipy.interpolate.BPoly.from_derivatives
scipy.interpolate.interp1d
scipy.interpolate.KroghInterpolator
scipy.interpolate.PchipInterpolator
scipy.interpolate.CubicSpline
Fill missing values using different methods.
Piecewise cubic polynomials (Akima interpolator).
Piecewise polynomial in the Bernstein basis.
Interpolate a 1-D function.
Interpolate polynomial (Krogh interpolator).
PCHIP 1-d monotonic cubic interpolation.
Cubic spline data interpolator.
Notes
The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around the respective SciPy implementations of similar names. These use the actual numerical values of the index. For more information on their behavior, see the SciPy documentation and SciPy tutorial.
Examples
Filling in
NaN
in aSeries
via linear interpolation.>>> s = pd.Series([0, 1, np.nan, 3]) >>> s 0 0.0 1 1.0 2 NaN 3 3.0 dtype: float64 >>> s.interpolate() 0 0.0 1 1.0 2 2.0 3 3.0 dtype: float64
Filling in
NaN
in a Series by padding, but filling at most two consecutiveNaN
at a time.>>> s = pd.Series([np.nan, "single_one", np.nan, ... "fill_two_more", np.nan, np.nan, np.nan, ... 4.71, np.nan]) >>> s 0 NaN 1 single_one 2 NaN 3 fill_two_more 4 NaN 5 NaN 6 NaN 7 4.71 8 NaN dtype: object >>> s.interpolate(method='pad', limit=2) 0 NaN 1 single_one 2 single_one 3 fill_two_more 4 fill_two_more 5 fill_two_more 6 NaN 7 4.71 8 4.71 dtype: object
Filling in
NaN
in a Series via polynomial interpolation or splines: Both ‘polynomial’ and ‘spline’ methods require that you also specify anorder
(int).>>> s = pd.Series([0, 2, np.nan, 8]) >>> s.interpolate(method='polynomial', order=2) 0 0.000000 1 2.000000 2 4.666667 3 8.000000 dtype: float64
Fill the DataFrame forward (that is, going down) along each column using linear interpolation.
Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after it to use for interpolation. Note how the first entry in column ‘b’ remains
NaN
, because there is no entry before it to use for interpolation.>>> df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0), ... (np.nan, 2.0, np.nan, np.nan), ... (2.0, 3.0, np.nan, 9.0), ... (np.nan, 4.0, -4.0, 16.0)], ... columns=list('abcd')) >>> df a b c d 0 0.0 NaN -1.0 1.0 1 NaN 2.0 NaN NaN 2 2.0 3.0 NaN 9.0 3 NaN 4.0 -4.0 16.0 >>> df.interpolate(method='linear', limit_direction='forward', axis=0) a b c d 0 0.0 NaN -1.0 1.0 1 1.0 2.0 -2.0 5.0 2 2.0 3.0 -3.0 9.0 3 2.0 4.0 -4.0 16.0
Using polynomial interpolation.
>>> df['d'].interpolate(method='polynomial', order=2) 0 1.0 1 4.0 2 9.0 3 16.0 Name: d, dtype: float64
- isin(values)[source]
Whether each element in the DataFrame is contained in values.
- Parameters
- Returns
- DataFrame
DataFrame of booleans showing whether each element in the DataFrame is contained in values.
See alsoDataFrame.eq
Series.isin
Series.str.contains
Equality test for DataFrame.
Equivalent method on Series.
Test if pattern or regex is contained within a string of a Series or Index.
Examples
>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]}, ... index=['falcon', 'dog']) >>> df num_legs num_wings falcon 2 2 dog 4 0
When
values
is a list check whether every value in the DataFrame is present in the list (which animals have 0 or 2 legs or wings)>>> df.isin([0, 2]) num_legs num_wings falcon True True dog False True
When
values
is a dict, we can pass values to check for each column separately:>>> df.isin({'num_wings': [0, 3]}) num_legs num_wings falcon False False dog False True
When
values
is a Series or DataFrame the index and column must match. Note that ‘falcon’ does not match based on the number of legs in df2.>>> other = pd.DataFrame({'num_legs': [8, 2], 'num_wings': [0, 2]}, ... index=['spider', 'falcon']) >>> df.isin(other) num_legs num_wings falcon True True dog False False
- isna()[source]
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).- Returns
- DataFrame
Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
See alsoDataFrame.isnull
DataFrame.notna
DataFrame.dropna
isna
Alias of isna.
Boolean inverse of isna.
Omit axes labels with missing values.
Top-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() 0 False 1 False 2 True dtype: bool
- isnull()[source]
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).- Returns
- DataFrame
Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
See alsoDataFrame.isnull
DataFrame.notna
DataFrame.dropna
isna
Alias of isna.
Boolean inverse of isna.
Omit axes labels with missing values.
Top-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() 0 False 1 False 2 True dtype: bool
- items()[source]
Iterate over (column name, Series) pairs.
Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.
- Yields
- labelobject
- contentSeries
The column names for the DataFrame being iterated over.
The column entries belonging to each label, as a Series.
See alsoDataFrame.iterrows
DataFrame.itertuples
Iterate over DataFrame rows as (index, Series) pairs.
Iterate over DataFrame rows as namedtuples of the values.
Examples
>>> df = pd.DataFrame({'species': ['bear', 'bear', 'marsupial'], ... 'population': [1864, 22000, 80000]}, ... index=['panda', 'polar', 'koala']) >>> df species population panda bear 1864 polar bear 22000 koala marsupial 80000 >>> for label, content in df.items(): ... print(f'label:{label}') ... print(f'content:{content}', sep='\n') ... label: species content: panda bear polar bear koala marsupial Name: species, dtype: object label: population content: panda 1864 polar 22000 koala 80000 Name: population, dtype: int64
- iteritems()[source]
Iterate over (column name, Series) pairs.
Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.
- Yields
- labelobject
- contentSeries
The column names for the DataFrame being iterated over.
The column entries belonging to each label, as a Series.
See alsoDataFrame.iterrows
DataFrame.itertuples
Iterate over DataFrame rows as (index, Series) pairs.
Iterate over DataFrame rows as namedtuples of the values.
Examples
>>> df = pd.DataFrame({'species': ['bear', 'bear', 'marsupial'], ... 'population': [1864, 22000, 80000]}, ... index=['panda', 'polar', 'koala']) >>> df species population panda bear 1864 polar bear 22000 koala marsupial 80000 >>> for label, content in df.items(): ... print(f'label:{label}') ... print(f'content:{content}', sep='\n') ... label: species content: panda bear polar bear koala marsupial Name: species, dtype: object label: population content: panda 1864 polar 22000 koala 80000 Name: population, dtype: int64
- iterrows()[source]
Iterate over DataFrame rows as (index, Series) pairs.
- Yields
- indexlabel or tuple of label
- dataSeries
The index of the row. A tuple for a
MultiIndex
.The data of the row as a Series.
See alsoDataFrame.itertuples
DataFrame.items
Iterate over DataFrame rows as namedtuples of the values.
Iterate over (column name, Series) pairs.
Notes
Because
iterrows
returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float']) >>> row = next(df.iterrows())[1] >>> row int 1.0 float 1.5 Name: 0, dtype: float64 >>> print(row['int'].dtype) float64 >>> print(df['int'].dtype) int64
To preserve dtypes while iterating over the rows, it is better to use
itertuples()
which returns namedtuples of the values and which is generally faster thaniterrows
.You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
- itertuples(index=True, name='Pandas')[source]
Iterate over DataFrame rows as namedtuples.
- Parameters
- indexbool, default True
- namestr or None, default “Pandas”
If True, return the index as the first element of the tuple.
The name of the returned namedtuples or None to return regular tuples.
- Returns
- iterator
An object to iterate over namedtuples for each row in the DataFrame with the first field possibly being the index and following fields being the column values.
See alsoDataFrame.iterrows
DataFrame.items
Iterate over DataFrame rows as (index, Series) pairs.
Iterate over (column name, Series) pairs.
Notes
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. On python versions < 3.7 regular tuples are returned for DataFrames with a large number of columns (>254).
Examples
>>> df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]}, ... index=['dog', 'hawk']) >>> df num_legs num_wings dog 4 0 hawk 2 2 >>> for row in df.itertuples(): ... print(row) ... Pandas(Index='dog', num_legs=4, num_wings=0) Pandas(Index='hawk', num_legs=2, num_wings=2)
By setting the
index
parameter to False we can remove the index as the first element of the tuple:>>> for row in df.itertuples(index=False): ... print(row) ... Pandas(num_legs=4, num_wings=0) Pandas(num_legs=2, num_wings=2)
With the
name
parameter set we set a custom name for the yielded namedtuples:>>> for row in df.itertuples(name='Animal'): ... print(row) ... Animal(Index='dog', num_legs=4, num_wings=0) Animal(Index='hawk', num_legs=2, num_wings=2)
- join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)[source]
Join columns of another DataFrame.
Join columns with
other
DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.- Parameters
- otherDataFrame, Series, or list of DataFrame
- onstr, list of str, or array-like, optional
- how{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’
left: use calling frame’s index (or column if on is specified)
right: use
other
’s index.outer: form union of calling frame’s index (or column if on is specified) with
other
’s index, and sort it. lexicographically.inner: form intersection of calling frame’s index (or column if on is specified) with
other
’s index, preserving the order of the calling’s one.- lsuffixstr, default ‘’
- rsuffixstr, default ‘’
- sortbool, default False
Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame.
Column or index level name(s) in the caller to join on the index in
other
, otherwise joins index-on-index. If multiple values given, theother
DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.How to handle the operation of the two objects.
Suffix to use from left frame’s overlapping columns.
Suffix to use from right frame’s overlapping columns.
Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword).
- Returns
- DataFrame
A dataframe containing columns from both the caller and
other
.
See alsoDataFrame.merge
For column(s)-on-column(s) operations.
Notes
Parameters
on
,lsuffix
, andrsuffix
are not supported when passing a list ofDataFrame
objects.Support for specifying index levels as the
on
parameter was added in version 0.23.0.Examples
>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], ... 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df key A 0 K0 A0 1 K1 A1 2 K2 A2 3 K3 A3 4 K4 A4 5 K5 A5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'], ... 'B': ['B0', 'B1', 'B2']})
>>> other key B 0 K0 B0 1 K1 B1 2 K2 B2
Join DataFrames using their indexes.
>>> df.join(other, lsuffix='_caller', rsuffix='_other') key_caller A key_other B 0 K0 A0 K0 B0 1 K1 A1 K1 B1 2 K2 A2 K2 B2 3 K3 A3 NaN NaN 4 K4 A4 NaN NaN 5 K5 A5 NaN NaN
If we want to join using the key columns, we need to set key to be the index in both
df
andother
. The joined DataFrame will have key as its index.>>> df.set_index('key').join(other.set_index('key')) A B key K0 A0 B0 K1 A1 B1 K2 A2 B2 K3 A3 NaN K4 A4 NaN K5 A5 NaN
Another option to join using the key columns is to use the
on
parameter. DataFrame.join always usesother
’s index but we can use any column indf
. This method preserves the original DataFrame’s index in the result.>>> df.join(other.set_index('key'), on='key') key A B 0 K0 A0 B0 1 K1 A1 B1 2 K2 A2 B2 3 K3 A3 NaN 4 K4 A4 NaN 5 K5 A5 NaN
- keys()[source]
Get the ‘info axis’ (see Indexing for more).
This is index for Series, columns for DataFrame.
- Returns
- Index
Info axis.
- kurt(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)[source]
Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
- Parameters
- axis{index (0), columns (1)}
- skipnabool, default True
- levelint or level name, default None
- numeric_onlybool, default None
- **kwargs
Axis for the function to be applied on.
Exclude NA/null values when computing the result.
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Additional keyword arguments to be passed to the function.
- Returns
- Series or DataFrame (if level specified)
- kurtosis(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)[source]
Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
- Parameters
- axis{index (0), columns (1)}
- skipnabool, default True
- levelint or level name, default None
- numeric_onlybool, default None
- **kwargs
Axis for the function to be applied on.
Exclude NA/null values when computing the result.
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Additional keyword arguments to be passed to the function.
- Returns
- Series or DataFrame (if level specified)
- last(offset)[source]
Select final periods of time series data based on a date offset.
For a DataFrame with a sorted DatetimeIndex, this function selects the last few rows based on a date offset.
- Parameters
- offsetstr, DateOffset, dateutil.relativedelta
The offset length of the data that will be selected. For instance, ‘3D’ will display all the rows having their index within the last 3 days.
- Returns
- Series or DataFrame
A subset of the caller.
- Raises
- TypeError
If the index is not a
DatetimeIndex
See alsofirst
at_time
between_time
Select initial periods of time series based on a date offset.
Select values at a particular time of the day.
Select values between particular times of the day.
Examples
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4
Get the rows for the last 3 days:
>>> ts.last('3D') A 2018-04-13 3 2018-04-15 4
Notice the data for 3 last calendar days were returned, not the last 3 observed days in the dataset, and therefore data for 2018-04-11 was not returned.
- last_valid_index()[source]
Return index for last non-NA value or None, if no NA value is found.
- Returns
- scalartype of index
Notes
If all elements are non-NA/null, returns None. Also returns None for empty Series/DataFrame.
- le(other, axis='columns', level=None)[source]
Get Less than or equal to of dataframe and other, element-wise (binary operator
le
).Among flexible wrappers (
eq
,ne
,le
,lt
,ge
,gt
) to comparison operators.Equivalent to
==
,=
,<=
,<
,>=
,>
with support to choose axis (rows or columns) and level for comparison.- Parameters
- otherscalar, sequence, Series, or DataFrame
- axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’
- levelint or label
Any single or multiple element data structure, or list-like object.
Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns
- DataFrame of bool
Result of the comparison.
See alsoDataFrame.eq
DataFrame.ne
DataFrame.le
DataFrame.lt
DataFrame.ge
DataFrame.gt
Compare DataFrames for equality elementwise.
Compare DataFrames for inequality elementwise.
Compare DataFrames for less than inequality or equality elementwise.
Compare DataFrames for strictly less than inequality elementwise.
Compare DataFrames for greater than inequality or equality elementwise.
Compare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together.
NaN
values are considered different (i.e.NaN
!=NaN
).Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When
other
is aSeries
, the columns of a DataFrame are aligned with the index ofother
and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in
other
:>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- property loc: pandas.core.indexing._LocIndexer
Access a group of rows and columns by label(s) or a boolean array.
.loc[]
is primarily label based, but may also be used with a boolean array.Allowed inputs are:
A single label, e.g.
5
or'a'
, (note that5
is interpreted as a label of the index, and never as an integer position along the index).A list or array of labels, e.g.
['a', 'b', 'c']
.A slice object with labels, e.g.
'a':'f'
.WarningNote that contrary to usual python slices, both the start and the stop are included
A boolean array of the same length as the axis being sliced, e.g.
[True, False, True]
.An alignable boolean Series. The index of the key will be aligned before masking.
An alignable Index. The Index of the returned selection will be the input.
A
callable
function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)
See more at Selection by Label.
- Raises
- KeyError
- IndexingError
If any items are not found.
If an indexed key is passed and its index is unalignable to the frame index.
See alsoDataFrame.at
DataFrame.iloc
DataFrame.xs
Series.loc
Access a single value for a row/column label pair.
Access group of rows and columns by integer position(s).
Returns a cross-section (row(s) or column(s)) from the Series/DataFrame.
Access group of values using labels.
Examples
Getting values
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]], ... index=['cobra', 'viper', 'sidewinder'], ... columns=['max_speed', 'shield']) >>> df max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8
Single label. Note this returns the row as a Series.
>>> df.loc['viper'] max_speed 4 shield 5 Name: viper, dtype: int64
List of labels. Note using
[[]]
returns a DataFrame.>>> df.loc[['viper', 'sidewinder']] max_speed shield viper 4 5 sidewinder 7 8
Single label for row and column
>>> df.loc['cobra', 'shield'] 2
Slice with labels for row and single label for column. As mentioned above, note that both the start and stop of the slice are included.
>>> df.loc['cobra':'viper', 'max_speed'] cobra 1 viper 4 Name: max_speed, dtype: int64
Boolean list with the same length as the row axis
>>> df.loc[[False, False, True]] max_speed shield sidewinder 7 8
Alignable boolean Series:
>>> df.loc[pd.Series([False, True, False], ... index=['viper', 'sidewinder', 'cobra'])] max_speed shield sidewinder 7 8
Index (same behavior as
df.reindex
)>>> df.loc[pd.Index(["cobra", "viper"], name="foo")] max_speed shield foo cobra 1 2 viper 4 5
Conditional that returns a boolean Series
>>> df.loc[df['shield'] > 6] max_speed shield sidewinder 7 8
Conditional that returns a boolean Series with column labels specified
>>> df.loc[df['shield'] > 6, ['max_speed']] max_speed sidewinder 7
Callable that returns a boolean Series
>>> df.loc[lambda df: df['shield'] == 8] max_speed shield sidewinder 7 8
Setting values
Set value for all items matching the list of labels
>>> df.loc[['viper', 'sidewinder'], ['shield']] = 50 >>> df max_speed shield cobra 1 2 viper 4 50 sidewinder 7 50
Set value for an entire row
>>> df.loc['cobra'] = 10 >>> df max_speed shield cobra 10 10 viper 4 50 sidewinder 7 50
Set value for an entire column
>>> df.loc[:, 'max_speed'] = 30 >>> df max_speed shield cobra 30 10 viper 30 50 sidewinder 30 50
Set value for rows matching callable condition
>>> df.loc[df['s