Pandas DataFrame Operations¶

In [1]:
import pandas as pd

Manage the online sales data¶

This dataset represents an example of data from online sales

Order ID:     Unique identifier for each sales order.

Date:         Date of the sales transaction.

Category:     Broad category of the product sold 
                (e.g., Electronics, Home Appliances, Clothing, Books, Beauty Products, Sports).

Product Name: Specific name or model of the product sold.

Quantity:     Number of units of the product sold in the transaction.

Unit Price:   Price of one unit of the product.

Total Price:  Total revenue generated from the sales transaction (Quantity * Unit Price).

Region:       Geographic region where the transaction occurred (e.g., North America, Europe, Asia).

Payment Method:      Method used for payment (e.g., Credit Card, PayPal, Debit Card).

Data source: Kaggle Online Sales Dataset https://www.kaggle.com/datasets/shreyanshverma27/online-sales-dataset-popular-marketplace-data

In [2]:
df=pd.read_csv( "C:/Acqrote/DataFrame/OnlineSalesData.csv")
In [3]:
df
Out[3]:
Transaction ID Date Product Category Product Name Units Sold Unit Price Total Revenue Region Payment Method
0 10001 2024-01-01 Electronics iPhone 14 Pro 2 999.99 1999.98 North America Credit Card
1 10002 2024-01-02 Home Appliances Dyson V11 Vacuum 1 499.99 499.99 Europe PayPal
2 10003 2024-01-03 Clothing Levi's 501 Jeans 3 69.99 209.97 Asia Debit Card
3 10004 2024-01-04 Books The Da Vinci Code 4 15.99 63.96 North America Credit Card
4 10005 2024-01-05 Beauty Products Neutrogena Skincare Set 1 89.99 89.99 Europe PayPal
... ... ... ... ... ... ... ... ... ...
235 10236 2024-08-23 Home Appliances Nespresso Vertuo Next Coffee and Espresso Maker 1 159.99 159.99 Europe PayPal
236 10237 2024-08-24 Clothing Nike Air Force 1 Sneakers 3 90.00 270.00 Asia Debit Card
237 10238 2024-08-25 Books The Handmaid's Tale by Margaret Atwood 3 10.99 32.97 North America Credit Card
238 10239 2024-08-26 Beauty Products Sunday Riley Luna Sleeping Night Oil 1 55.00 55.00 Europe PayPal
239 10240 2024-08-27 Sports Yeti Rambler 20 oz Tumbler 2 29.99 59.98 Asia Credit Card

240 rows × 9 columns

Get column names¶

In [4]:
df.columns
Out[4]:
Index(['Transaction ID', 'Date', 'Product Category', 'Product Name',
       'Units Sold', 'Unit Price', 'Total Revenue', 'Region',
       'Payment Method'],
      dtype='object')

Get DataFrame size¶

In [5]:
def getDataFrameSize(df: pd.DataFrame) -> list[int]:
    (c,d)=df.shape
    return [c, d]

getDataFrameSize(df)
Out[5]:
[240, 9]

Select a subset¶

Find the product category 'Electronics'

Display the last 6 rows of the subset ['Product Category', 'Product Name', 'Unit Price','Units Sold', 'Date']

Sort the order by 'Date' in an ascending order

Check the accuracy of the column 'Total Revenue'¶

Count the number of mistakes if any.

Report the count of the 'Result Label':['OK', 'Mistake'] in a column named 'Result Label Count']

In [6]:
def selectData(df: pd.DataFrame, colName:str, product:str, colList:list) -> pd.DataFrame:
    return df[df[colName]==product][colList].sort_values(by=['Date']).tail(6)

colList=['Product Category', 'Product Name', 'Unit Price','Units Sold', 'Date']

electronicsSales=selectData(df, 'Product Category', 'Electronics', colList)
electronicsSales
Out[6]:
Product Category Product Name Unit Price Units Sold Date
204 Electronics Samsung Galaxy Watch 4 299.99 1 2024-07-23
210 Electronics Canon EOS Rebel T7i DSLR Camera 749.99 1 2024-07-29
216 Electronics Apple MacBook Pro 16-inch 2399.00 1 2024-08-04
222 Electronics Amazon Echo Dot (4th Gen) 49.99 4 2024-08-10
228 Electronics Google Nest Wifi Router 169.00 1 2024-08-16
234 Electronics Bose QuietComfort 35 II Wireless Headphones 299.00 1 2024-08-22
In [7]:
def checkTotalRevenue(B: pd.DataFrame, colName:list) -> pd.DataFrame:
    A=B.copy()
    A['Revenue Check']=(A['Unit Price']*A['Units Sold']-A['Total Revenue']).map(insertRevenue_label)
    
    return A[colName].groupby('Revenue Check').size().reset_index(name='Result Label Count')

def insertRevenue_label(delta):
    if delta<0.01:
        return 'OK'
    else:
        return 'Mistake'
    
colName=['Revenue Check']


[TotalSalesQty, col]=getDataFrameSize(df)
print("Total Sales Lineitems:   ", TotalSalesQty)
checkTotalRevenue(df, colName)
Total Sales Lineitems:    240
Out[7]:
Revenue Check Result Label Count
0 OK 240

Drop duplicated transactions¶

Check the transactions and drop the duplicates

In [8]:
def dropDuplicateTransactions(df: pd.DataFrame) -> pd.DataFrame:
    return df.drop_duplicates(keep='first', subset=['Transaction ID'])

[TransactionQtyBefore, col]=getDataFrameSize(df)

[TransactionQtyAfter, col]=getDataFrameSize(dropDuplicateTransactions(df))

print("Transaction quantities in 'dropDuplicateTransaction'\n", 
"[before: ", TransactionQtyBefore, "]   versus   [after: ", TransactionQtyAfter,"]")
Transaction quantities in 'dropDuplicateTransaction'
 [before:  240 ]   versus   [after:  240 ]

Remove rows with missing values which cannot be recovered from other data¶

In [9]:
def dropMissingData(df: pd.DataFrame) -> pd.DataFrame:
    
    conditions = [
    df['Units Sold'].isna(),
    df['Total Revenue'].isna()
    ]

# Combine conditions using logical AND
    final_mask = conditions[0] & conditions[1]

    return df[~final_mask]


[TransactionQtyBefore, col]=getDataFrameSize(df)

[TransactionQtyAfter, col]=getDataFrameSize(dropMissingData(df))

print("Transaction quantities in 'dropMissingData'\n",
    "[before: ", TransactionQtyBefore, "]   versus   [after: ", TransactionQtyAfter,"]")
Transaction quantities in 'dropMissingData'
 [before:  240 ]   versus   [after:  240 ]

Total Revenue per Product Category during the 'back-to-school event'¶

List the product category and their revenue by 'Date' in the range from '2024-07-15' to '2024-08-15'

Pivot a DataFrame¶

Pivots a DataFrame by spreading the values of a specific column across multiple columns, using values from another column as the new column headers.

In [10]:
def pivotTable(df: pd.DataFrame, start_date: str, end_date: str) -> pd.DataFrame:
    productDF=df.pivot(index='Date', columns='Product Category', values='Total Revenue')
    
    return productDF[start_date:end_date]

BacktoSchoolDF=pivotTable(df, '2024-07-15', '2024-08-15')
BacktoSchoolDF
Out[10]:
Product Category Beauty Products Books Clothing Electronics Home Appliances Sports
Date
2024-07-15 125.0 NaN NaN NaN NaN NaN
2024-07-16 NaN NaN NaN NaN NaN 449.99
2024-07-17 NaN NaN NaN 358.00 NaN NaN
2024-07-18 NaN NaN NaN NaN 99.95 NaN
2024-07-19 NaN NaN 179.97 NaN NaN NaN
2024-07-20 NaN 29.98 NaN NaN NaN NaN
2024-07-21 52.0 NaN NaN NaN NaN NaN
2024-07-22 NaN NaN NaN NaN NaN 399.99
2024-07-23 NaN NaN NaN 299.99 NaN NaN
2024-07-24 NaN NaN NaN NaN 379.99 NaN
2024-07-25 NaN NaN 196.00 NaN NaN NaN
2024-07-26 NaN 50.97 NaN NaN NaN NaN
2024-07-27 79.0 NaN NaN NaN NaN NaN
2024-07-28 NaN NaN NaN NaN NaN 129.00
2024-07-29 NaN NaN NaN 749.99 NaN NaN
2024-07-30 NaN NaN NaN NaN 339.98 NaN
2024-07-31 NaN NaN 39.60 NaN NaN NaN
2024-08-01 NaN 32.97 NaN NaN NaN NaN
2024-08-02 58.0 NaN NaN NaN NaN NaN
2024-08-03 NaN NaN NaN NaN NaN 349.99
2024-08-04 NaN NaN NaN 2399.00 NaN NaN
2024-08-05 NaN NaN NaN NaN 449.99 NaN
2024-08-06 NaN NaN 149.97 NaN NaN NaN
2024-08-07 NaN 25.98 NaN NaN NaN NaN
2024-08-08 27.0 NaN NaN NaN NaN NaN
2024-08-09 NaN NaN NaN NaN NaN 599.99
2024-08-10 NaN NaN NaN 199.96 NaN NaN
2024-08-11 NaN NaN NaN NaN 459.98 NaN
2024-08-12 NaN NaN 89.98 NaN NaN NaN
2024-08-13 NaN 80.97 NaN NaN NaN NaN
2024-08-14 6.7 NaN NaN NaN NaN NaN
2024-08-15 NaN NaN NaN NaN NaN 299.90

Compress 'Product Category' in multiple columns and list 'Total Revenue' in one column¶

For each product category, list their revenue at each date

Use the dataframe 'BacktoSchoolDF'. The output shall return to the format of the original df.

Melt a DataFrame¶

Melts(Unpivots) a DataFrame by gathering multiple columns into two new columns:

one for variable names (column headers) and another for their corresponding values.

In [11]:
def meltTable(df: pd.DataFrame) -> pd.DataFrame:
    df.reset_index(inplace=True)   ### Make sure the ['Date'] is a column, not an index column
    meltProduct=df.melt(id_vars='Date', var_name='Product Category', value_name='Total Revenue')
    return meltProduct.dropna(subset = ['Total Revenue']).sort_values(by=['Date']).reset_index(drop=True)

BacktoSchoolProducts=meltTable(BacktoSchoolDF)
BacktoSchoolProducts
Out[11]:
Date Product Category Total Revenue
0 2024-07-15 Beauty Products 125.00
1 2024-07-16 Sports 449.99
2 2024-07-17 Electronics 358.00
3 2024-07-18 Home Appliances 99.95
4 2024-07-19 Clothing 179.97
5 2024-07-20 Books 29.98
6 2024-07-21 Beauty Products 52.00
7 2024-07-22 Sports 399.99
8 2024-07-23 Electronics 299.99
9 2024-07-24 Home Appliances 379.99
10 2024-07-25 Clothing 196.00
11 2024-07-26 Books 50.97
12 2024-07-27 Beauty Products 79.00
13 2024-07-28 Sports 129.00
14 2024-07-29 Electronics 749.99
15 2024-07-30 Home Appliances 339.98
16 2024-07-31 Clothing 39.60
17 2024-08-01 Books 32.97
18 2024-08-02 Beauty Products 58.00
19 2024-08-03 Sports 349.99
20 2024-08-04 Electronics 2399.00
21 2024-08-05 Home Appliances 449.99
22 2024-08-06 Clothing 149.97
23 2024-08-07 Books 25.98
24 2024-08-08 Beauty Products 27.00
25 2024-08-09 Sports 599.99
26 2024-08-10 Electronics 199.96
27 2024-08-11 Home Appliances 459.98
28 2024-08-12 Clothing 89.98
29 2024-08-13 Books 80.97
30 2024-08-14 Beauty Products 6.70
31 2024-08-15 Sports 299.90

Count 'Units Sold' by Price Group¶

Cut 'Unit Price' into three groups per the inputs of 'Low Price', 'Medium Price' and 'High Price'. Display the sum of 'Units Sold' in each group.

In [12]:
def aggregateUnitPrice(B:pd.DataFrame, lowPrice:float, medPrice:float, highPrice:float)->pd.DataFrame:
    A=B.copy()
    # Define the bins
    bins = [0,lowPrice, medPrice, highPrice]
    labels = ['Low Price (<=$'+str(lowPrice)+')', 'Medium Price', 'High Price (>$'+str(highPrice)+')']

    # Create a new column with the group labels
    A['Price Group'] = pd.cut(A['Unit Price'], bins=bins, labels=labels)
    
    # Aggregate the values based on the groups
    grouped = A.groupby('Price Group', )[['Units Sold']].sum()
    
    return grouped 

aggregateUnitPrice(df, 1750.00, 3500.00, 5000.00)
Out[12]:
Units Sold
Price Group
Low Price (<=$1750.0) 514
Medium Price 3
High Price (>$5000.0) 1

Find products which were sold in Europe and paid via Paypal¶

Group the found products by 'Product Category' and sum the total 'Units Sold'

In [13]:
def region_payment(df: pd.DataFrame, region:str, payment:str):
    regionSales= df.loc[((df['Region'] ==region) & (df['Payment Method']==payment)), \
    ['Product Category', 'Units Sold', 'Region', 'Payment Method']]
    
    display=regionSales.groupby(['Product Category', 'Region', 'Payment Method'])[['Units Sold']].sum()
    return display

region_payment(df, 'Europe', 'PayPal')
        
Out[13]:
Units Sold
Product Category Region Payment Method
Beauty Products Europe PayPal 46
Home Appliances Europe PayPal 59

Find lengthy product names¶

Find the prouct name that has more than specified number of characters

In [14]:
def invalid_productnames(df: pd.DataFrame, charlen:int) -> pd.DataFrame:     
      return df.loc[df['Product Name'].str.len()>charlen, ['Transaction ID','Product Name'] ]
    
invalid_productnames(df, 50)
Out[14]:
Transaction ID Product Name
111 10112 Sapiens: A Brief History of Humankind by Yuval...
142 10143 Clinique Dramatically Different Moisturizing L...
172 10173 Paula's Choice Skin Perfecting 2% BHA Liquid E...
202 10203 Dr. Jart+ Cicapair Tiger Grass Color Correctin...

Add a column to display 'Total Revenue' in Europe using the currency 'euros'¶

Display the top six revenue in Europe

In [15]:
def checkEuropeRegion(region:str, revenue:float, exchangeRate:float):
    if region=='Europe':
        return revenue*exchangeRate
    else:
        return 'N/A'
def exchangeCurrency(B: pd.DataFrame)-> pd.DataFrame:
    A=B.copy()
    A['Total Revenue (Euro)']=A.apply(lambda x: checkEuropeRegion(region=x['Region'], revenue=x['Total Revenue'], exchangeRate=0.97), axis=1)
    result=A[['Region', 'Total Revenue', 'Total Revenue (Euro)']]
    return result[result['Region']=='Europe'].sort_values('Total Revenue (Euro)', ascending=False).head(6)

exchangeCurrency(df)
    
Out[15]:
Region Total Revenue Total Revenue (Euro)
85 Europe 2599.98 2521.9806
37 Europe 1599.98 1551.9806
7 Europe 1199.98 1163.9806
73 Europe 899.99 872.9903
127 Europe 799.98 775.9806
103 Europe 699.98 678.9806

Find the transactions of highest revenue in each product category¶

In [16]:
def maxRevenue_ProductCategory(df:pd.DataFrame)-> pd.DataFrame:
    highest_revenue = df.groupby('Product Category').apply(lambda x: x[x['Total Revenue'] == x['Total Revenue'].max()])
    return highest_revenue[['Product Category', 'Product Name', 'Unit Price', 'Units Sold', 'Total Revenue', 'Region']].sort_values('Total Revenue', ascending=False)

maxRevenue_ProductCategory(df)
Out[16]:
Product Category Product Name Unit Price Units Sold Total Revenue Region
Product Category
Electronics 102 Electronics Canon EOS R5 Camera 3899.99 1 3899.99 North America
Home Appliances 85 Home Appliances LG OLED TV 1299.99 2 2599.98 Europe
Sports 35 Sports Peloton Bike 1895.00 1 1895.00 Asia
Clothing 8 Clothing Nike Air Force 1 89.99 6 539.94 Asia
Beauty Products 16 Beauty Products Dyson Supersonic Hair Dryer 399.99 1 399.99 Europe
Books 45 Books Becoming by Michelle Obama 32.50 4 130.00 North America

Rank the total revenue by product category in a specific payment method¶

Allow the inputs of 'Payment Method' and 'Product Category'.

Display top 10 results of ['Payment Method', 'Total Revenue', 'Revenue Rank', 'Product Category', 'Region'],
sorted with an ascending order of 'Revenue Rank'

In [18]:
def rankRevenue_PaymentMethod(B:pd.DataFrame, paymentMethod:str, productCat:str)->pd.DataFrame:
    A=B.copy()
    A['Revenue Rank']= A.groupby(['Payment Method', 'Product Category'])['Total Revenue'].rank(method='dense', ascending=False)
    result=A[['Payment Method', 'Total Revenue', 'Revenue Rank', 'Product Category', 'Region']].sort_values(by=['Payment Method', 'Revenue Rank'], ascending=True)
    return result[(result['Payment Method']==paymentMethod) & (result['Product Category']==productCat)].head(10)
    
rankRevenue_PaymentMethod(df, 'PayPal', 'Home Appliances')
Out[18]:
Payment Method Total Revenue Revenue Rank Product Category Region
85 PayPal 2599.98 1.0 Home Appliances Europe
37 PayPal 1599.98 2.0 Home Appliances Europe
7 PayPal 1199.98 3.0 Home Appliances Europe
73 PayPal 899.99 4.0 Home Appliances Europe
127 PayPal 799.98 5.0 Home Appliances Europe
103 PayPal 699.98 6.0 Home Appliances Europe
67 PayPal 659.97 7.0 Home Appliances Europe
115 PayPal 599.90 8.0 Home Appliances Europe
229 PayPal 599.00 9.0 Home Appliances Europe
1 PayPal 499.99 10.0 Home Appliances Europe

Find the top sold quantity per day of each product category¶

Sort 'Unit Sold' in a descending order Allow the user to specify the quantity of the most-sold product category groups to be displayed.
The columns to be displayed are ['Product Categry(Top 'qty' groups)', 'Units Sold', 'Region' 'Payment Method']

In [20]:
def topProductSold(df:pd.DataFrame, qty:int)->pd.DataFrame:
    top = df.sort_values(['Units Sold', 'Product Category'], ascending=False).groupby('Product Category').head(qty).reset_index(drop=True)
    
    result=top [['Product Category', 'Units Sold', 'Region', 'Payment Method']]
    result=result.rename(columns={'Product Category': 'Product Category (Top '+str(qty)+' groups)'})
    return result
topProductSold(df, 2)
Out[20]:
Product Category (Top 2 groups) Units Sold Region Payment Method
0 Clothing 10 Asia Debit Card
1 Sports 6 Asia Credit Card
2 Sports 6 Asia Credit Card
3 Clothing 6 Asia Debit Card
4 Electronics 4 North America Credit Card
5 Electronics 4 North America Credit Card
6 Books 4 North America Credit Card
7 Books 4 North America Credit Card
8 Home Appliances 3 Europe PayPal
9 Home Appliances 3 Europe PayPal
10 Beauty Products 2 Europe PayPal
11 Beauty Products 2 Europe PayPal

Count the quantities of unique product category at each region¶

In [22]:
def count_unique_subjects(df: pd.DataFrame) -> pd.DataFrame:
    result=    df.groupby(['Region'])['Product Category'].nunique().reset_index()
    
    result=result.rename(columns={'Product Category':'Unique Category Count'})
    return result

count_unique_subjects(df)
    
Out[22]:
Region Unique Category Count
0 Asia 2
1 Europe 2
2 North America 2

Find the average revenue > a specified value in each region¶

List the result in one specific region displaying the columns ['Region', 'Product Category', 'Product Name', 'Total Revenue'].
Sort the result by 'Product Category' in an ascending order.

In [24]:
def filteredAveRevenueRegion(df:pd.DataFrame, region:str, average:float)->pd.DataFrame:
    
    result=df.groupby('Product Category').filter(lambda x: x['Total Revenue'].mean()>average).sort_values(by='Region')[['Region', 'Product Category', 'Product Name', 'Total Revenue' ]]
    return result[result['Region']==region]

filteredAveRevenueRegion(df, 'Asia', 210.00)
Out[24]:
Region Product Category Product Name Total Revenue
119 Asia Sports YETI Hopper Flip Portable Cooler 249.99
77 Asia Sports Rogue Fitness Kettlebell 209.97
83 Asia Sports Spalding NBA Street Basketball 149.94
89 Asia Sports On Running Cloud Shoes 259.98
95 Asia Sports Garmin Fenix 6X Pro 999.99
101 Asia Sports Under Armour HOVR Sonic 4 Shoes 219.98
107 Asia Sports Bowflex SelectTech 552 Dumbbells 399.99
113 Asia Sports Fitbit Versa 3 689.85
125 Asia Sports Yeti Roadie 24 Cooler 199.99
131 Asia Sports Hydro Flask Wide Mouth Water Bottle 159.80
137 Asia Sports Fitbit Inspire 2 199.90
143 Asia Sports YETI Tundra 45 Cooler 299.99
149 Asia Sports Bose SoundSport Wireless Earbuds 299.98
71 Asia Sports Nike Metcon 6 389.97
155 Asia Sports Garmin Forerunner 245 299.99
167 Asia Sports Yeti Rambler Bottle 149.97
173 Asia Sports Bowflex SelectTech 1090 Adjustable Dumbbells 699.99
179 Asia Sports Oakley Holbrook Sunglasses 146.00
185 Asia Sports Polar Vantage V2 499.95
191 Asia Sports TRX All-in-One Suspension Training System 169.95
197 Asia Sports GoPro HERO9 Black 449.99
203 Asia Sports Yeti Tundra Haul Portable Wheeled Cooler 399.99
209 Asia Sports Bose SoundLink Color Bluetooth Speaker II 129.00
215 Asia Sports YETI Tundra 65 Cooler 349.99
221 Asia Sports Garmin Forerunner 945 599.99
227 Asia Sports Fitbit Luxe 299.90
233 Asia Sports Hydro Flask Standard Mouth Water Bottle 98.85
161 Asia Sports TriggerPoint GRID Foam Roller 69.98
65 Asia Sports Garmin Edge 530 599.98
239 Asia Sports Yeti Rambler 20 oz Tumbler 59.98
35 Asia Sports Peloton Bike 1895.00
17 Asia Sports Manduka PRO Yoga Mat 479.96
11 Asia Sports Babolat Pure Drive Tennis Racket 599.97
23 Asia Sports Adidas FIFA World Cup Football 89.97
47 Asia Sports Yeti Rambler Tumbler 239.94
5 Asia Sports Wilson Evolution Basketball 149.95
53 Asia Sports Titleist Pro V1 Golf Balls 249.95
41 Asia Sports Fitbit Charge 5 259.98
59 Asia Sports Hyperice Hypervolt Massager 349.00
29 Asia Sports Nike Air Zoom Pegasus 37 259.98

Find most expensive product in each product category¶

In [26]:
def mostExpensiveProduct(df: pd.DataFrame):
    result= df.groupby(['Product Category' ])[['Unit Price']].max()
    result.rename(columns={'Unit Price':'Max Unit Price'}, inplace=True)
   
    return result

mostExpensiveProduct(df)
Out[26]:
Max Unit Price
Product Category
Beauty Products 399.99
Books 35.99
Clothing 249.99
Electronics 3899.99
Home Appliances 1299.99
Sports 1895.00

Aggregate Product Category, Display 'Revenue sum' and 'Dates' in each group¶

In [28]:
def aggProductbyRevenueDates(df:pd.DataFrame)->pd.DataFrame:

    return df.groupby('Product Category').agg(
        Revenue=('Total Revenue', 'sum'),
        Dates=('Date', lambda x: ',   '.join(sorted(x.unique())))
    ).reset_index()

aggProductbyRevenueDates(df) 
Out[28]:
Product Category Revenue Dates
0 Beauty Products 2621.90 2024-01-05, 2024-01-11, 2024-01-17, 2024...
1 Books 1861.93 2024-01-04, 2024-01-10, 2024-01-16, 2024...
2 Clothing 8128.93 2024-01-03, 2024-01-09, 2024-01-15, 2024...
3 Electronics 34982.41 2024-01-01, 2024-01-07, 2024-01-13, 2024...
4 Home Appliances 18646.16 2024-01-02, 2024-01-08, 2024-01-14, 2024...
5 Sports 14326.52 2024-01-06, 2024-01-12, 2024-01-18, 2024...

Divide the dataframe into two subsets: Product Catalog and Product Sales¶

In [30]:
ProductCatalog=df[['Product Category', 'Product Name', 'Unit Price']]
ProductSales=df[['Transaction ID', 'Date', 'Product Name', 'Units Sold', 'Region', 'Payment Method']]

Combine two dataframes into one dataframe and add a column 'Total Revenue'¶

In [32]:
def mergeTwoTables(dfa:pd.DataFrame, dfb:pd.DataFrame)->pd.DataFrame:
    colName=['Transaction ID', 'Date', 'Product Category', 'Product Name', 'Unit Price', 'Units Sold', 'Region', 'Payment Method']
    result=pd.merge(dfa, dfb, left_on='Product Name', right_on='Product Name', how='inner')[colName].drop_duplicates(keep='first', subset='Transaction ID')
    result['Total Revenue']=result['Unit Price']*result['Units Sold']
    return result

print('Shape of First Table: ', ProductCatalog.shape)
print('Shape of Second Table: ', ProductSales.shape)

finalTable=mergeTwoTables(ProductSales, ProductCatalog)
print(finalTable.shape)
finalTable.head(6)
Shape of First Table:  (240, 3)
Shape of Second Table:  (240, 6)
(240, 9)
Out[32]:
Transaction ID Date Product Category Product Name Unit Price Units Sold Region Payment Method Total Revenue
0 10001 2024-01-01 Electronics iPhone 14 Pro 999.99 2 North America Credit Card 1999.98
1 10002 2024-01-02 Home Appliances Dyson V11 Vacuum 499.99 1 Europe PayPal 499.99
2 10003 2024-01-03 Clothing Levi's 501 Jeans 69.99 3 Asia Debit Card 209.97
3 10004 2024-01-04 Books The Da Vinci Code 15.99 4 North America Credit Card 63.96
4 10005 2024-01-05 Beauty Products Neutrogena Skincare Set 89.99 1 Europe PayPal 89.99
5 10006 2024-01-06 Sports Wilson Evolution Basketball 29.99 5 Asia Credit Card 149.95

Show the dataframe which excludes the BacktoSchool items¶

In [34]:
def ExcludeItems(dfA:pd.DataFrame, dfB:pd.DataFrame, colName:str )->pd.DataFrame:
    return dfA[~dfA['Date'].isin(dfB['Date'])].reset_index()

ExcludeItems(df, BacktoSchoolDF, 'Date')      
Out[34]:
index Transaction ID Date Product Category Product Name Units Sold Unit Price Total Revenue Region Payment Method
0 0 10001 2024-01-01 Electronics iPhone 14 Pro 2 999.99 1999.98 North America Credit Card
1 1 10002 2024-01-02 Home Appliances Dyson V11 Vacuum 1 499.99 499.99 Europe PayPal
2 2 10003 2024-01-03 Clothing Levi's 501 Jeans 3 69.99 209.97 Asia Debit Card
3 3 10004 2024-01-04 Books The Da Vinci Code 4 15.99 63.96 North America Credit Card
4 4 10005 2024-01-05 Beauty Products Neutrogena Skincare Set 1 89.99 89.99 Europe PayPal
... ... ... ... ... ... ... ... ... ... ...
203 235 10236 2024-08-23 Home Appliances Nespresso Vertuo Next Coffee and Espresso Maker 1 159.99 159.99 Europe PayPal
204 236 10237 2024-08-24 Clothing Nike Air Force 1 Sneakers 3 90.00 270.00 Asia Debit Card
205 237 10238 2024-08-25 Books The Handmaid's Tale by Margaret Atwood 3 10.99 32.97 North America Credit Card
206 238 10239 2024-08-26 Beauty Products Sunday Riley Luna Sleeping Night Oil 1 55.00 55.00 Europe PayPal
207 239 10240 2024-08-27 Sports Yeti Rambler 20 oz Tumbler 2 29.99 59.98 Asia Credit Card

208 rows × 10 columns

In [35]:
print(df.columns)
print(df.shape)
Index(['Transaction ID', 'Date', 'Product Category', 'Product Name',
       'Units Sold', 'Unit Price', 'Total Revenue', 'Region',
       'Payment Method'],
      dtype='object')
(240, 9)

End of File¶

In [ ]: