Andrew Wang | Portfolio - Data Mart Project

DataMart

1. Setup

We begin by setting up the title, author, and output of your programming project. This code shows that this chunk will not be included in final HTML document, and the libraries or packages that you will be using have been loaded in.

--- title: "Walmart Purchase Analysis" author: "Andrew Wang, Isabelle Bolen, Ahmed Elsayed, Alex Hensgen" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` #Load in Libraries ```{r} library(tidyverse)r library(caret) library(ggthemes) library(leaps) library(doParallel) ```

2. Getting the Data Ready

In this section we load the data into the dataframe. Then took a quick look at the data structure, and a summary of statistics. After checking for missing values, we convert the catergorical variables into factors for modeling, selecting only revelent columns for analysis.

## Load in Walmart Data ```{r} Walmart <- read_csv("Data\\walmart.csv") ``` ## Examine Walmart Data ```{r} glimpse(Walmart) summary(Walmart) ``` ## Check for Missing Data ```{r} colSums(is.na(Walmart)) ``` ## Convert DataTypes to Factors ```{r} Walmart <- Walmart |> mutate(Gender = as.factor(Gender), Age = as.factor(Age), Occupation = as.factor(Occupation), City_Category = as.factor(City_Category), Stay_In_Current_City_Years = as.factor(Stay_In_Current_City_Years), Marital_Status = as.factor(Marital_Status), Product_Category = as.factor(Product_Category)) |> select(Purchase,Age,Gender,City_Category,Stay_In_Current_City_Years,Occupation,Marital_Status,Product_Category) ```

3. Exploratory Data Analysis

The first visual shows the total purchase amounts by product category and gender. The second shows spending patterns by age and gender.

## Exploratory Visual 1 - Distribution of Purchase Amounts ```{r} ## Aggregate Totals by Product Category and Gender walmart_purchase_category_gender <- Walmart |> group_by(Product_Category,Gender) |> summarize(Total_Purchase = sum(Purchase,na.rm=TRUE)) |> arrange(desc(Total_Purchase)) ## Generate Visualization walmart_purchase_category_gender |> ggplot(aes(x=Product_Category,y=Total_Purchase))+ geom_col(aes(fill=Gender))+ theme_minimal()+ scale_fill_brewer(palette="Dark2")+ labs(title = "Total Purchase by Product Category and Gender", x = "Product Category", y = "Total Purchase Amount") ``` ## Exploratory Visual 2 - Purchase Amounts by Age Group and Gender ```{r} walmart_purchase_age_groups <- Walmart |> group_by(Age,Gender) |> summarize(Total_purchase = sum(Purchase)) ## Generate Visualization walmart_purchase_age_groups |> ggplot(aes(Age, Total_purchase))+ geom_col(aes(fill=Gender))+ scale_fill_brewer(palette="Dark2")+ theme_minimal()+ labs(title="Total Purchase Amount by Age Group and Gender", x= "Age Group", y= "Total Purchase Amount") ```

4. Feature Selection

We used leaps to find optimal predictors, this uses a search to identify which predictors or variables best fit the R² or the proportion of variance in the dependent variable.

## Use Leaps Package to Determine Optimal Predictors for Linear Regression ```{r} best_subsets <- regsubsets(Purchase~.,data=Walmart,method="exhaustive",really.big=T) lm_summary <- summary(best_subsets) lm_index <- which.max(lm_summary$adjr2) best_model_adjr2 <- lm_summary$adjr2[lm_index] best_model_adjr2 ``` ## Divide Training/Testing Set ```{r} set.seed(1) train.index <- createDataPartition(Walmart$Purchase,p=0.8,list=FALSE) train.df <- Walmart[train.index,] valid.df <- Walmart[-train.index,] rm(train.index) ```

5. Model 1: Linear Regression

In Model 1 we split the data into 80 training and 20 validation sets. Then we train a linear regression model with 10-fold cross-validation. Doing this allows us to preform analysis that uncovers R², or accuracy, RMSE, or error magnitude, and MAE, or average error.

## Run Model 1 Linear Regression ```{r} linearRegressionModel1 <- lm(Purchase ~ ., data = train.df) #Cross Validation 10x crossValidationModel1 <- train(Purchase ~ ., data = train.df, method = "lm", trControl = trainControl(method = "cv", number = 10)) crossValidationModel1 ``` # Run on Validation Set ```{r} test_results <- predict(crossValidationModel1,newdata=valid.df) actuals <- valid.df$Purchase valid_r2 <- cor(test_results,actuals)^2 valid_r2 valid_rmse <- sqrt(mean((test_results - actuals)^2)) valid_rmse valid_mae <- mean(abs(test_results - actuals)) valid_mae ```

6. Model 2: Random Forest

Speeding up the model training by using multiple CPU cores, the purpose of this model to capture non-linear relationships. Additionally we evaluate the performance of the Random Forest model on the validation data.

# Model 2 ## Set up parallel processing ```{r} cl <- makePSOCKcluster(detectCores()-1) registerDoParallel(cl) ``` ## Random Forest ```{r} #CrossValidation setup control <- trainControl(method="cv",number =10) #Parameters for RandomForest Model tuneGrid <- expand.grid( mtry=27, splitrule="extratrees", min.node.size=5 ) #Training RF model set.seed(1) randomForestModel <- train( Purchase~., data=train.df, method="ranger", trControl=control, tuneGrid=tuneGrid, num.trees = 200 ) print(randomForestModel) stopCluster(cl) ``` ## Validate Random Forest on Valid Set ```{r} rf_test_results <- predict(randomForestModel,newdata = valid.df) rf_actuals <- valid.df$Purchase rf_valid_r2 <- cor(rf_test_results,rf_actuals)^2 rf_valid_r2 ```