A DataFrame is a data type from pandas module. It represents spreadsheet-like data (rows and columns).
You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.
DataFrames can contain different data types: strings, ints, floats, tuples, etc.
Components of a DataFrame DataFrame objects consist of three components, stored as attributes:
df.columns
df.index
You can create DataFrame by passing in a dictionary or a list of lists to the pd.DataFrame()
method, or by reading data from a CSV file pd.read_csv
Passing in a dictionary to pd.DataFrame()
list_of_dicts = [
{"name": "Anna", "height_cm": 110, "weight_kg": 2, "date_of_birth": "2019-03-14"},
{"name": "Jane", "height_cm": 121, "weight_kg": 2, "date_of_birth": "2019-05-09"}
]
new_babies = pd.DataFrame(dict_of_lists)
We can also create DataFrame this way:
import pandas as pd
df1 = pd.DataFrame({
'Product ID': [1, 2, 3, 4],
'Product Name': ['t-shirt', 't-shirt', 'skirt', 'skirt'],
'Color': ['blue', 'green', 'red', 'black']
})
Note: the columns will appear in alphabetical order because dictionaries don’t have any inherent order for columns.
When using this method, we need to go through the data column by column.
dict_of_lists = {
"name": ["Anna", "Jane"],
"height_cm": [110, 121],
"weight_kg": [2, 3],
"date_of_birth": ["2019-03-14", "2019-05-09"]
}
new_babies = pd.DataFrame(dict_of_lists)
We can also pass in data this way:
import pandas as pd
df2 = pd.DataFrame([
[1, 'San Diego', 100],
[2, 'Los Angeles', 120],
[3, 'San Francisco', 90],
[4, 'Sacramento', 115]
],
columns = [
'Store ID', 'Location', 'Number of Employees'
])
print(df2)
If it’s a small DataFrame, you can display it with print(df)
.
However, with a larger one, it’s better to display certain items without looking at the entire DataFrame.
.head()
method selects the first 5 rows of a DataFrame:
df.head()
print(df.head())
If you want to see more rows, add the number. For example, df.head(8)
shows the first 8 rows.
describe()
method display the summary statistics of the table, like mean and median.
df.describe()
The method info()
displays a summary of the table, such as the data type and number of missing values.
The result includes all columns and their data types.
df.info()
It’s useful for DataFrame with many columns that are difficult to display using .head()
.columns
display the column names of the table
.value_counts() displays the distinct values for a column
.shape
attribute contains a tuple that holds the number of rows followed by the number of columns.
df.shape
# Output: (4, 6)
To check the types of each column of a DataFrame, we can use:
print(df.dtypes)
To merge 2 or more DataFrames with matching rows, we use .merge()
pd.merge(df1, df2)
For example, to merge orders with customers, you could use:
new_df = pd.merge(orders, customers)
df1.merge(df2)
new_df = orders.merge(customers)
Outer Join would include all rows from both tables, even if they don’t match.
pd.merge(df1, df2, how='outer')
For example, company A has customer’s name and email. Company B has customer’s name and phone number.
We merge 2 companies’ data with:
data = pd.merge(company_a, company_b, how='outer')
print(data)
Any missing values are filled in with None or nan (which stands for “Not a Number”).
name | phone | |
---|---|---|
Hanna Jone | hjone@gmail.com | nan |
John Smith | jsmith@yahoo.com | 023-012 |
Jane May | jmay@gmail.com | 453-103 |
Niles Arr | nan | 467-305 |
3.1. Left merge includes all rows from the first (left) table, but only rows from the second (right) table that match the first table.
pd.merge(df1, df2, how='left')
3.2. Right merge includes all rows from the second (right) table, but only rows from the first (left) table that match the second table.
pd.merge(df1, df2, how='right')
To reconstruct a single DataFrame from multiple smaller DataFrames, we can use:
pd.concat([df1, df2, df2, ...])
This method only works if all DataFrames have the same numbers of columns.