import pandas as pd
This dataset represents an example of data from online sales
Order ID: Unique identifier for each sales order.
Date: Date of the sales transaction.
Category: Broad category of the product sold
(e.g., Electronics, Home Appliances, Clothing, Books, Beauty Products, Sports).
Product Name: Specific name or model of the product sold.
Quantity: Number of units of the product sold in the transaction.
Unit Price: Price of one unit of the product.
Total Price: Total revenue generated from the sales transaction (Quantity * Unit Price).
Region: Geographic region where the transaction occurred (e.g., North America, Europe, Asia).
Payment Method: Method used for payment (e.g., Credit Card, PayPal, Debit Card).
Data source: Kaggle Online Sales Dataset https://www.kaggle.com/datasets/shreyanshverma27/online-sales-dataset-popular-marketplace-data
df=pd.read_csv( "C:/Acqrote/DataFrame/OnlineSalesData.csv")
df
Transaction ID | Date | Product Category | Product Name | Units Sold | Unit Price | Total Revenue | Region | Payment Method | |
---|---|---|---|---|---|---|---|---|---|
0 | 10001 | 2024-01-01 | Electronics | iPhone 14 Pro | 2 | 999.99 | 1999.98 | North America | Credit Card |
1 | 10002 | 2024-01-02 | Home Appliances | Dyson V11 Vacuum | 1 | 499.99 | 499.99 | Europe | PayPal |
2 | 10003 | 2024-01-03 | Clothing | Levi's 501 Jeans | 3 | 69.99 | 209.97 | Asia | Debit Card |
3 | 10004 | 2024-01-04 | Books | The Da Vinci Code | 4 | 15.99 | 63.96 | North America | Credit Card |
4 | 10005 | 2024-01-05 | Beauty Products | Neutrogena Skincare Set | 1 | 89.99 | 89.99 | Europe | PayPal |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
235 | 10236 | 2024-08-23 | Home Appliances | Nespresso Vertuo Next Coffee and Espresso Maker | 1 | 159.99 | 159.99 | Europe | PayPal |
236 | 10237 | 2024-08-24 | Clothing | Nike Air Force 1 Sneakers | 3 | 90.00 | 270.00 | Asia | Debit Card |
237 | 10238 | 2024-08-25 | Books | The Handmaid's Tale by Margaret Atwood | 3 | 10.99 | 32.97 | North America | Credit Card |
238 | 10239 | 2024-08-26 | Beauty Products | Sunday Riley Luna Sleeping Night Oil | 1 | 55.00 | 55.00 | Europe | PayPal |
239 | 10240 | 2024-08-27 | Sports | Yeti Rambler 20 oz Tumbler | 2 | 29.99 | 59.98 | Asia | Credit Card |
240 rows × 9 columns
df.columns
Index(['Transaction ID', 'Date', 'Product Category', 'Product Name', 'Units Sold', 'Unit Price', 'Total Revenue', 'Region', 'Payment Method'], dtype='object')
def getDataFrameSize(df: pd.DataFrame) -> list[int]:
(c,d)=df.shape
return [c, d]
getDataFrameSize(df)
[240, 9]
Find the product category 'Electronics'
Display the last 6 rows of the subset ['Product Category', 'Product Name', 'Unit Price','Units Sold', 'Date']
Sort the order by 'Date' in an ascending order
Count the number of mistakes if any.
Report the count of the 'Result Label':['OK', 'Mistake'] in a column named 'Result Label Count']
def selectData(df: pd.DataFrame, colName:str, product:str, colList:list) -> pd.DataFrame:
return df[df[colName]==product][colList].sort_values(by=['Date']).tail(6)
colList=['Product Category', 'Product Name', 'Unit Price','Units Sold', 'Date']
electronicsSales=selectData(df, 'Product Category', 'Electronics', colList)
electronicsSales
Product Category | Product Name | Unit Price | Units Sold | Date | |
---|---|---|---|---|---|
204 | Electronics | Samsung Galaxy Watch 4 | 299.99 | 1 | 2024-07-23 |
210 | Electronics | Canon EOS Rebel T7i DSLR Camera | 749.99 | 1 | 2024-07-29 |
216 | Electronics | Apple MacBook Pro 16-inch | 2399.00 | 1 | 2024-08-04 |
222 | Electronics | Amazon Echo Dot (4th Gen) | 49.99 | 4 | 2024-08-10 |
228 | Electronics | Google Nest Wifi Router | 169.00 | 1 | 2024-08-16 |
234 | Electronics | Bose QuietComfort 35 II Wireless Headphones | 299.00 | 1 | 2024-08-22 |
def checkTotalRevenue(B: pd.DataFrame, colName:list) -> pd.DataFrame:
A=B.copy()
A['Revenue Check']=(A['Unit Price']*A['Units Sold']-A['Total Revenue']).map(insertRevenue_label)
return A[colName].groupby('Revenue Check').size().reset_index(name='Result Label Count')
def insertRevenue_label(delta):
if delta<0.01:
return 'OK'
else:
return 'Mistake'
colName=['Revenue Check']
[TotalSalesQty, col]=getDataFrameSize(df)
print("Total Sales Lineitems: ", TotalSalesQty)
checkTotalRevenue(df, colName)
Total Sales Lineitems: 240
Revenue Check | Result Label Count | |
---|---|---|
0 | OK | 240 |
Check the transactions and drop the duplicates
def dropDuplicateTransactions(df: pd.DataFrame) -> pd.DataFrame:
return df.drop_duplicates(keep='first', subset=['Transaction ID'])
[TransactionQtyBefore, col]=getDataFrameSize(df)
[TransactionQtyAfter, col]=getDataFrameSize(dropDuplicateTransactions(df))
print("Transaction quantities in 'dropDuplicateTransaction'\n",
"[before: ", TransactionQtyBefore, "] versus [after: ", TransactionQtyAfter,"]")
Transaction quantities in 'dropDuplicateTransaction' [before: 240 ] versus [after: 240 ]
def dropMissingData(df: pd.DataFrame) -> pd.DataFrame:
conditions = [
df['Units Sold'].isna(),
df['Total Revenue'].isna()
]
# Combine conditions using logical AND
final_mask = conditions[0] & conditions[1]
return df[~final_mask]
[TransactionQtyBefore, col]=getDataFrameSize(df)
[TransactionQtyAfter, col]=getDataFrameSize(dropMissingData(df))
print("Transaction quantities in 'dropMissingData'\n",
"[before: ", TransactionQtyBefore, "] versus [after: ", TransactionQtyAfter,"]")
Transaction quantities in 'dropMissingData' [before: 240 ] versus [after: 240 ]
List the product category and their revenue by 'Date' in the range from '2024-07-15' to '2024-08-15'
Pivots a DataFrame by spreading the values of a specific column across multiple columns, using values from another column as the new column headers.
def pivotTable(df: pd.DataFrame, start_date: str, end_date: str) -> pd.DataFrame:
productDF=df.pivot(index='Date', columns='Product Category', values='Total Revenue')
return productDF[start_date:end_date]
BacktoSchoolDF=pivotTable(df, '2024-07-15', '2024-08-15')
BacktoSchoolDF
Product Category | Beauty Products | Books | Clothing | Electronics | Home Appliances | Sports |
---|---|---|---|---|---|---|
Date | ||||||
2024-07-15 | 125.0 | NaN | NaN | NaN | NaN | NaN |
2024-07-16 | NaN | NaN | NaN | NaN | NaN | 449.99 |
2024-07-17 | NaN | NaN | NaN | 358.00 | NaN | NaN |
2024-07-18 | NaN | NaN | NaN | NaN | 99.95 | NaN |
2024-07-19 | NaN | NaN | 179.97 | NaN | NaN | NaN |
2024-07-20 | NaN | 29.98 | NaN | NaN | NaN | NaN |
2024-07-21 | 52.0 | NaN | NaN | NaN | NaN | NaN |
2024-07-22 | NaN | NaN | NaN | NaN | NaN | 399.99 |
2024-07-23 | NaN | NaN | NaN | 299.99 | NaN | NaN |
2024-07-24 | NaN | NaN | NaN | NaN | 379.99 | NaN |
2024-07-25 | NaN | NaN | 196.00 | NaN | NaN | NaN |
2024-07-26 | NaN | 50.97 | NaN | NaN | NaN | NaN |
2024-07-27 | 79.0 | NaN | NaN | NaN | NaN | NaN |
2024-07-28 | NaN | NaN | NaN | NaN | NaN | 129.00 |
2024-07-29 | NaN | NaN | NaN | 749.99 | NaN | NaN |
2024-07-30 | NaN | NaN | NaN | NaN | 339.98 | NaN |
2024-07-31 | NaN | NaN | 39.60 | NaN | NaN | NaN |
2024-08-01 | NaN | 32.97 | NaN | NaN | NaN | NaN |
2024-08-02 | 58.0 | NaN | NaN | NaN | NaN | NaN |
2024-08-03 | NaN | NaN | NaN | NaN | NaN | 349.99 |
2024-08-04 | NaN | NaN | NaN | 2399.00 | NaN | NaN |
2024-08-05 | NaN | NaN | NaN | NaN | 449.99 | NaN |
2024-08-06 | NaN | NaN | 149.97 | NaN | NaN | NaN |
2024-08-07 | NaN | 25.98 | NaN | NaN | NaN | NaN |
2024-08-08 | 27.0 | NaN | NaN | NaN | NaN | NaN |
2024-08-09 | NaN | NaN | NaN | NaN | NaN | 599.99 |
2024-08-10 | NaN | NaN | NaN | 199.96 | NaN | NaN |
2024-08-11 | NaN | NaN | NaN | NaN | 459.98 | NaN |
2024-08-12 | NaN | NaN | 89.98 | NaN | NaN | NaN |
2024-08-13 | NaN | 80.97 | NaN | NaN | NaN | NaN |
2024-08-14 | 6.7 | NaN | NaN | NaN | NaN | NaN |
2024-08-15 | NaN | NaN | NaN | NaN | NaN | 299.90 |
For each product category, list their revenue at each date
Use the dataframe 'BacktoSchoolDF'. The output shall return to the format of the original df.
Melts(Unpivots) a DataFrame by gathering multiple columns into two new columns:
one for variable names (column headers) and another for their corresponding values.
def meltTable(df: pd.DataFrame) -> pd.DataFrame:
df.reset_index(inplace=True) ### Make sure the ['Date'] is a column, not an index column
meltProduct=df.melt(id_vars='Date', var_name='Product Category', value_name='Total Revenue')
return meltProduct.dropna(subset = ['Total Revenue']).sort_values(by=['Date']).reset_index(drop=True)
BacktoSchoolProducts=meltTable(BacktoSchoolDF)
BacktoSchoolProducts
Date | Product Category | Total Revenue | |
---|---|---|---|
0 | 2024-07-15 | Beauty Products | 125.00 |
1 | 2024-07-16 | Sports | 449.99 |
2 | 2024-07-17 | Electronics | 358.00 |
3 | 2024-07-18 | Home Appliances | 99.95 |
4 | 2024-07-19 | Clothing | 179.97 |
5 | 2024-07-20 | Books | 29.98 |
6 | 2024-07-21 | Beauty Products | 52.00 |
7 | 2024-07-22 | Sports | 399.99 |
8 | 2024-07-23 | Electronics | 299.99 |
9 | 2024-07-24 | Home Appliances | 379.99 |
10 | 2024-07-25 | Clothing | 196.00 |
11 | 2024-07-26 | Books | 50.97 |
12 | 2024-07-27 | Beauty Products | 79.00 |
13 | 2024-07-28 | Sports | 129.00 |
14 | 2024-07-29 | Electronics | 749.99 |
15 | 2024-07-30 | Home Appliances | 339.98 |
16 | 2024-07-31 | Clothing | 39.60 |
17 | 2024-08-01 | Books | 32.97 |
18 | 2024-08-02 | Beauty Products | 58.00 |
19 | 2024-08-03 | Sports | 349.99 |
20 | 2024-08-04 | Electronics | 2399.00 |
21 | 2024-08-05 | Home Appliances | 449.99 |
22 | 2024-08-06 | Clothing | 149.97 |
23 | 2024-08-07 | Books | 25.98 |
24 | 2024-08-08 | Beauty Products | 27.00 |
25 | 2024-08-09 | Sports | 599.99 |
26 | 2024-08-10 | Electronics | 199.96 |
27 | 2024-08-11 | Home Appliances | 459.98 |
28 | 2024-08-12 | Clothing | 89.98 |
29 | 2024-08-13 | Books | 80.97 |
30 | 2024-08-14 | Beauty Products | 6.70 |
31 | 2024-08-15 | Sports | 299.90 |
Cut 'Unit Price' into three groups per the inputs of 'Low Price', 'Medium Price' and 'High Price'. Display the sum of 'Units Sold' in each group.
def aggregateUnitPrice(B:pd.DataFrame, lowPrice:float, medPrice:float, highPrice:float)->pd.DataFrame:
A=B.copy()
# Define the bins
bins = [0,lowPrice, medPrice, highPrice]
labels = ['Low Price (<=$'+str(lowPrice)+')', 'Medium Price', 'High Price (>$'+str(highPrice)+')']
# Create a new column with the group labels
A['Price Group'] = pd.cut(A['Unit Price'], bins=bins, labels=labels)
# Aggregate the values based on the groups
grouped = A.groupby('Price Group', )[['Units Sold']].sum()
return grouped
aggregateUnitPrice(df, 1750.00, 3500.00, 5000.00)
Units Sold | |
---|---|
Price Group | |
Low Price (<=$1750.0) | 514 |
Medium Price | 3 |
High Price (>$5000.0) | 1 |
Group the found products by 'Product Category' and sum the total 'Units Sold'
def region_payment(df: pd.DataFrame, region:str, payment:str):
regionSales= df.loc[((df['Region'] ==region) & (df['Payment Method']==payment)), \
['Product Category', 'Units Sold', 'Region', 'Payment Method']]
display=regionSales.groupby(['Product Category', 'Region', 'Payment Method'])[['Units Sold']].sum()
return display
region_payment(df, 'Europe', 'PayPal')
Units Sold | |||
---|---|---|---|
Product Category | Region | Payment Method | |
Beauty Products | Europe | PayPal | 46 |
Home Appliances | Europe | PayPal | 59 |
Find the prouct name that has more than specified number of characters
def invalid_productnames(df: pd.DataFrame, charlen:int) -> pd.DataFrame:
return df.loc[df['Product Name'].str.len()>charlen, ['Transaction ID','Product Name'] ]
invalid_productnames(df, 50)
Transaction ID | Product Name | |
---|---|---|
111 | 10112 | Sapiens: A Brief History of Humankind by Yuval... |
142 | 10143 | Clinique Dramatically Different Moisturizing L... |
172 | 10173 | Paula's Choice Skin Perfecting 2% BHA Liquid E... |
202 | 10203 | Dr. Jart+ Cicapair Tiger Grass Color Correctin... |
Display the top six revenue in Europe
def checkEuropeRegion(region:str, revenue:float, exchangeRate:float):
if region=='Europe':
return revenue*exchangeRate
else:
return 'N/A'
def exchangeCurrency(B: pd.DataFrame)-> pd.DataFrame:
A=B.copy()
A['Total Revenue (Euro)']=A.apply(lambda x: checkEuropeRegion(region=x['Region'], revenue=x['Total Revenue'], exchangeRate=0.97), axis=1)
result=A[['Region', 'Total Revenue', 'Total Revenue (Euro)']]
return result[result['Region']=='Europe'].sort_values('Total Revenue (Euro)', ascending=False).head(6)
exchangeCurrency(df)
Region | Total Revenue | Total Revenue (Euro) | |
---|---|---|---|
85 | Europe | 2599.98 | 2521.9806 |
37 | Europe | 1599.98 | 1551.9806 |
7 | Europe | 1199.98 | 1163.9806 |
73 | Europe | 899.99 | 872.9903 |
127 | Europe | 799.98 | 775.9806 |
103 | Europe | 699.98 | 678.9806 |
def maxRevenue_ProductCategory(df:pd.DataFrame)-> pd.DataFrame:
highest_revenue = df.groupby('Product Category').apply(lambda x: x[x['Total Revenue'] == x['Total Revenue'].max()])
return highest_revenue[['Product Category', 'Product Name', 'Unit Price', 'Units Sold', 'Total Revenue', 'Region']].sort_values('Total Revenue', ascending=False)
maxRevenue_ProductCategory(df)
Product Category | Product Name | Unit Price | Units Sold | Total Revenue | Region | ||
---|---|---|---|---|---|---|---|
Product Category | |||||||
Electronics | 102 | Electronics | Canon EOS R5 Camera | 3899.99 | 1 | 3899.99 | North America |
Home Appliances | 85 | Home Appliances | LG OLED TV | 1299.99 | 2 | 2599.98 | Europe |
Sports | 35 | Sports | Peloton Bike | 1895.00 | 1 | 1895.00 | Asia |
Clothing | 8 | Clothing | Nike Air Force 1 | 89.99 | 6 | 539.94 | Asia |
Beauty Products | 16 | Beauty Products | Dyson Supersonic Hair Dryer | 399.99 | 1 | 399.99 | Europe |
Books | 45 | Books | Becoming by Michelle Obama | 32.50 | 4 | 130.00 | North America |
Allow the inputs of 'Payment Method' and 'Product Category'.
Display top 10 results of ['Payment Method', 'Total Revenue', 'Revenue Rank', 'Product Category', 'Region'],
sorted with an ascending order of 'Revenue Rank'
def rankRevenue_PaymentMethod(B:pd.DataFrame, paymentMethod:str, productCat:str)->pd.DataFrame:
A=B.copy()
A['Revenue Rank']= A.groupby(['Payment Method', 'Product Category'])['Total Revenue'].rank(method='dense', ascending=False)
result=A[['Payment Method', 'Total Revenue', 'Revenue Rank', 'Product Category', 'Region']].sort_values(by=['Payment Method', 'Revenue Rank'], ascending=True)
return result[(result['Payment Method']==paymentMethod) & (result['Product Category']==productCat)].head(10)
rankRevenue_PaymentMethod(df, 'PayPal', 'Home Appliances')
Payment Method | Total Revenue | Revenue Rank | Product Category | Region | |
---|---|---|---|---|---|
85 | PayPal | 2599.98 | 1.0 | Home Appliances | Europe |
37 | PayPal | 1599.98 | 2.0 | Home Appliances | Europe |
7 | PayPal | 1199.98 | 3.0 | Home Appliances | Europe |
73 | PayPal | 899.99 | 4.0 | Home Appliances | Europe |
127 | PayPal | 799.98 | 5.0 | Home Appliances | Europe |
103 | PayPal | 699.98 | 6.0 | Home Appliances | Europe |
67 | PayPal | 659.97 | 7.0 | Home Appliances | Europe |
115 | PayPal | 599.90 | 8.0 | Home Appliances | Europe |
229 | PayPal | 599.00 | 9.0 | Home Appliances | Europe |
1 | PayPal | 499.99 | 10.0 | Home Appliances | Europe |
Sort 'Unit Sold' in a descending order
Allow the user to specify the quantity of the most-sold product category groups to be displayed.
The columns to be displayed are ['Product Categry(Top 'qty' groups)', 'Units Sold', 'Region' 'Payment Method']
def topProductSold(df:pd.DataFrame, qty:int)->pd.DataFrame:
top = df.sort_values(['Units Sold', 'Product Category'], ascending=False).groupby('Product Category').head(qty).reset_index(drop=True)
result=top [['Product Category', 'Units Sold', 'Region', 'Payment Method']]
result=result.rename(columns={'Product Category': 'Product Category (Top '+str(qty)+' groups)'})
return result
topProductSold(df, 2)
Product Category (Top 2 groups) | Units Sold | Region | Payment Method | |
---|---|---|---|---|
0 | Clothing | 10 | Asia | Debit Card |
1 | Sports | 6 | Asia | Credit Card |
2 | Sports | 6 | Asia | Credit Card |
3 | Clothing | 6 | Asia | Debit Card |
4 | Electronics | 4 | North America | Credit Card |
5 | Electronics | 4 | North America | Credit Card |
6 | Books | 4 | North America | Credit Card |
7 | Books | 4 | North America | Credit Card |
8 | Home Appliances | 3 | Europe | PayPal |
9 | Home Appliances | 3 | Europe | PayPal |
10 | Beauty Products | 2 | Europe | PayPal |
11 | Beauty Products | 2 | Europe | PayPal |
def count_unique_subjects(df: pd.DataFrame) -> pd.DataFrame:
result= df.groupby(['Region'])['Product Category'].nunique().reset_index()
result=result.rename(columns={'Product Category':'Unique Category Count'})
return result
count_unique_subjects(df)
Region | Unique Category Count | |
---|---|---|
0 | Asia | 2 |
1 | Europe | 2 |
2 | North America | 2 |
List the result in one specific region displaying the columns ['Region', 'Product Category', 'Product Name', 'Total Revenue'].
Sort the result by 'Product Category' in an ascending order.
def filteredAveRevenueRegion(df:pd.DataFrame, region:str, average:float)->pd.DataFrame:
result=df.groupby('Product Category').filter(lambda x: x['Total Revenue'].mean()>average).sort_values(by='Region')[['Region', 'Product Category', 'Product Name', 'Total Revenue' ]]
return result[result['Region']==region]
filteredAveRevenueRegion(df, 'Asia', 210.00)
Region | Product Category | Product Name | Total Revenue | |
---|---|---|---|---|
119 | Asia | Sports | YETI Hopper Flip Portable Cooler | 249.99 |
77 | Asia | Sports | Rogue Fitness Kettlebell | 209.97 |
83 | Asia | Sports | Spalding NBA Street Basketball | 149.94 |
89 | Asia | Sports | On Running Cloud Shoes | 259.98 |
95 | Asia | Sports | Garmin Fenix 6X Pro | 999.99 |
101 | Asia | Sports | Under Armour HOVR Sonic 4 Shoes | 219.98 |
107 | Asia | Sports | Bowflex SelectTech 552 Dumbbells | 399.99 |
113 | Asia | Sports | Fitbit Versa 3 | 689.85 |
125 | Asia | Sports | Yeti Roadie 24 Cooler | 199.99 |
131 | Asia | Sports | Hydro Flask Wide Mouth Water Bottle | 159.80 |
137 | Asia | Sports | Fitbit Inspire 2 | 199.90 |
143 | Asia | Sports | YETI Tundra 45 Cooler | 299.99 |
149 | Asia | Sports | Bose SoundSport Wireless Earbuds | 299.98 |
71 | Asia | Sports | Nike Metcon 6 | 389.97 |
155 | Asia | Sports | Garmin Forerunner 245 | 299.99 |
167 | Asia | Sports | Yeti Rambler Bottle | 149.97 |
173 | Asia | Sports | Bowflex SelectTech 1090 Adjustable Dumbbells | 699.99 |
179 | Asia | Sports | Oakley Holbrook Sunglasses | 146.00 |
185 | Asia | Sports | Polar Vantage V2 | 499.95 |
191 | Asia | Sports | TRX All-in-One Suspension Training System | 169.95 |
197 | Asia | Sports | GoPro HERO9 Black | 449.99 |
203 | Asia | Sports | Yeti Tundra Haul Portable Wheeled Cooler | 399.99 |
209 | Asia | Sports | Bose SoundLink Color Bluetooth Speaker II | 129.00 |
215 | Asia | Sports | YETI Tundra 65 Cooler | 349.99 |
221 | Asia | Sports | Garmin Forerunner 945 | 599.99 |
227 | Asia | Sports | Fitbit Luxe | 299.90 |
233 | Asia | Sports | Hydro Flask Standard Mouth Water Bottle | 98.85 |
161 | Asia | Sports | TriggerPoint GRID Foam Roller | 69.98 |
65 | Asia | Sports | Garmin Edge 530 | 599.98 |
239 | Asia | Sports | Yeti Rambler 20 oz Tumbler | 59.98 |
35 | Asia | Sports | Peloton Bike | 1895.00 |
17 | Asia | Sports | Manduka PRO Yoga Mat | 479.96 |
11 | Asia | Sports | Babolat Pure Drive Tennis Racket | 599.97 |
23 | Asia | Sports | Adidas FIFA World Cup Football | 89.97 |
47 | Asia | Sports | Yeti Rambler Tumbler | 239.94 |
5 | Asia | Sports | Wilson Evolution Basketball | 149.95 |
53 | Asia | Sports | Titleist Pro V1 Golf Balls | 249.95 |
41 | Asia | Sports | Fitbit Charge 5 | 259.98 |
59 | Asia | Sports | Hyperice Hypervolt Massager | 349.00 |
29 | Asia | Sports | Nike Air Zoom Pegasus 37 | 259.98 |
def mostExpensiveProduct(df: pd.DataFrame):
result= df.groupby(['Product Category' ])[['Unit Price']].max()
result.rename(columns={'Unit Price':'Max Unit Price'}, inplace=True)
return result
mostExpensiveProduct(df)
Max Unit Price | |
---|---|
Product Category | |
Beauty Products | 399.99 |
Books | 35.99 |
Clothing | 249.99 |
Electronics | 3899.99 |
Home Appliances | 1299.99 |
Sports | 1895.00 |
def aggProductbyRevenueDates(df:pd.DataFrame)->pd.DataFrame:
return df.groupby('Product Category').agg(
Revenue=('Total Revenue', 'sum'),
Dates=('Date', lambda x: ', '.join(sorted(x.unique())))
).reset_index()
aggProductbyRevenueDates(df)
Product Category | Revenue | Dates | |
---|---|---|---|
0 | Beauty Products | 2621.90 | 2024-01-05, 2024-01-11, 2024-01-17, 2024... |
1 | Books | 1861.93 | 2024-01-04, 2024-01-10, 2024-01-16, 2024... |
2 | Clothing | 8128.93 | 2024-01-03, 2024-01-09, 2024-01-15, 2024... |
3 | Electronics | 34982.41 | 2024-01-01, 2024-01-07, 2024-01-13, 2024... |
4 | Home Appliances | 18646.16 | 2024-01-02, 2024-01-08, 2024-01-14, 2024... |
5 | Sports | 14326.52 | 2024-01-06, 2024-01-12, 2024-01-18, 2024... |
ProductCatalog=df[['Product Category', 'Product Name', 'Unit Price']]
ProductSales=df[['Transaction ID', 'Date', 'Product Name', 'Units Sold', 'Region', 'Payment Method']]
def mergeTwoTables(dfa:pd.DataFrame, dfb:pd.DataFrame)->pd.DataFrame:
colName=['Transaction ID', 'Date', 'Product Category', 'Product Name', 'Unit Price', 'Units Sold', 'Region', 'Payment Method']
result=pd.merge(dfa, dfb, left_on='Product Name', right_on='Product Name', how='inner')[colName].drop_duplicates(keep='first', subset='Transaction ID')
result['Total Revenue']=result['Unit Price']*result['Units Sold']
return result
print('Shape of First Table: ', ProductCatalog.shape)
print('Shape of Second Table: ', ProductSales.shape)
finalTable=mergeTwoTables(ProductSales, ProductCatalog)
print(finalTable.shape)
finalTable.head(6)
Shape of First Table: (240, 3) Shape of Second Table: (240, 6) (240, 9)
Transaction ID | Date | Product Category | Product Name | Unit Price | Units Sold | Region | Payment Method | Total Revenue | |
---|---|---|---|---|---|---|---|---|---|
0 | 10001 | 2024-01-01 | Electronics | iPhone 14 Pro | 999.99 | 2 | North America | Credit Card | 1999.98 |
1 | 10002 | 2024-01-02 | Home Appliances | Dyson V11 Vacuum | 499.99 | 1 | Europe | PayPal | 499.99 |
2 | 10003 | 2024-01-03 | Clothing | Levi's 501 Jeans | 69.99 | 3 | Asia | Debit Card | 209.97 |
3 | 10004 | 2024-01-04 | Books | The Da Vinci Code | 15.99 | 4 | North America | Credit Card | 63.96 |
4 | 10005 | 2024-01-05 | Beauty Products | Neutrogena Skincare Set | 89.99 | 1 | Europe | PayPal | 89.99 |
5 | 10006 | 2024-01-06 | Sports | Wilson Evolution Basketball | 29.99 | 5 | Asia | Credit Card | 149.95 |
def ExcludeItems(dfA:pd.DataFrame, dfB:pd.DataFrame, colName:str )->pd.DataFrame:
return dfA[~dfA['Date'].isin(dfB['Date'])].reset_index()
ExcludeItems(df, BacktoSchoolDF, 'Date')
index | Transaction ID | Date | Product Category | Product Name | Units Sold | Unit Price | Total Revenue | Region | Payment Method | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 10001 | 2024-01-01 | Electronics | iPhone 14 Pro | 2 | 999.99 | 1999.98 | North America | Credit Card |
1 | 1 | 10002 | 2024-01-02 | Home Appliances | Dyson V11 Vacuum | 1 | 499.99 | 499.99 | Europe | PayPal |
2 | 2 | 10003 | 2024-01-03 | Clothing | Levi's 501 Jeans | 3 | 69.99 | 209.97 | Asia | Debit Card |
3 | 3 | 10004 | 2024-01-04 | Books | The Da Vinci Code | 4 | 15.99 | 63.96 | North America | Credit Card |
4 | 4 | 10005 | 2024-01-05 | Beauty Products | Neutrogena Skincare Set | 1 | 89.99 | 89.99 | Europe | PayPal |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
203 | 235 | 10236 | 2024-08-23 | Home Appliances | Nespresso Vertuo Next Coffee and Espresso Maker | 1 | 159.99 | 159.99 | Europe | PayPal |
204 | 236 | 10237 | 2024-08-24 | Clothing | Nike Air Force 1 Sneakers | 3 | 90.00 | 270.00 | Asia | Debit Card |
205 | 237 | 10238 | 2024-08-25 | Books | The Handmaid's Tale by Margaret Atwood | 3 | 10.99 | 32.97 | North America | Credit Card |
206 | 238 | 10239 | 2024-08-26 | Beauty Products | Sunday Riley Luna Sleeping Night Oil | 1 | 55.00 | 55.00 | Europe | PayPal |
207 | 239 | 10240 | 2024-08-27 | Sports | Yeti Rambler 20 oz Tumbler | 2 | 29.99 | 59.98 | Asia | Credit Card |
208 rows × 10 columns
print(df.columns)
print(df.shape)
Index(['Transaction ID', 'Date', 'Product Category', 'Product Name', 'Units Sold', 'Unit Price', 'Total Revenue', 'Region', 'Payment Method'], dtype='object') (240, 9)