Exploratory Data Analysis - Laptops market

The objective of this project is to analyze laptops’ prices based on their components, using exploratory data analysis tools. An appropriate data cleaning and EDA allows us to build high quality models and understand market context:

library(tidyverse)
library(DataExplorer)
library(stringr)
library(ggplot2)
library(rpart)
library(rpart.plot)
library(ggthemes)
laptop_raw <- read.csv("laptop_price.csv")
str(laptop_raw)
## 'data.frame':	1303 obs. of  13 variables:
##  $ laptop_ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Company         : chr  "Apple" "Apple" "HP" "Apple" ...
##  $ Product         : chr  "MacBook Pro" "Macbook Air" "250 G6" "MacBook Pro" ...
##  $ TypeName        : chr  "Ultrabook" "Ultrabook" "Notebook" "Ultrabook" ...
##  $ Inches          : num  13.3 13.3 15.6 15.4 13.3 15.6 15.4 13.3 14 14 ...
##  $ ScreenResolution: chr  "IPS Panel Retina Display 2560x1600" "1440x900" "Full HD 1920x1080" "IPS Panel Retina Display 2880x1800" ...
##  $ Cpu             : chr  "Intel Core i5 2.3GHz" "Intel Core i5 1.8GHz" "Intel Core i5 7200U 2.5GHz" "Intel Core i7 2.7GHz" ...
##  $ Ram             : chr  "8GB" "8GB" "8GB" "16GB" ...
##  $ Memory          : chr  "128GB SSD" "128GB Flash Storage" "256GB SSD" "512GB SSD" ...
##  $ Gpu             : chr  "Intel Iris Plus Graphics 640" "Intel HD Graphics 6000" "Intel HD Graphics 620" "AMD Radeon Pro 455" ...
##  $ OpSys           : chr  "macOS" "macOS" "No OS" "macOS" ...
##  $ Weight          : chr  "1.37kg" "1.34kg" "1.86kg" "1.83kg" ...
##  $ Price_euros     : num  1340 899 575 2537 1804 ...
head(laptop_raw)
##   laptop_ID Company     Product  TypeName Inches
## 1         1   Apple MacBook Pro Ultrabook   13.3
## 2         2   Apple Macbook Air Ultrabook   13.3
## 3         3      HP      250 G6  Notebook   15.6
## 4         4   Apple MacBook Pro Ultrabook   15.4
## 5         5   Apple MacBook Pro Ultrabook   13.3
## 6         6    Acer    Aspire 3  Notebook   15.6
##                     ScreenResolution                        Cpu  Ram
## 1 IPS Panel Retina Display 2560x1600       Intel Core i5 2.3GHz  8GB
## 2                           1440x900       Intel Core i5 1.8GHz  8GB
## 3                  Full HD 1920x1080 Intel Core i5 7200U 2.5GHz  8GB
## 4 IPS Panel Retina Display 2880x1800       Intel Core i7 2.7GHz 16GB
## 5 IPS Panel Retina Display 2560x1600       Intel Core i5 3.1GHz  8GB
## 6                           1366x768    AMD A9-Series 9420 3GHz  4GB
##                Memory                          Gpu      OpSys Weight
## 1           128GB SSD Intel Iris Plus Graphics 640      macOS 1.37kg
## 2 128GB Flash Storage       Intel HD Graphics 6000      macOS 1.34kg
## 3           256GB SSD        Intel HD Graphics 620      No OS 1.86kg
## 4           512GB SSD           AMD Radeon Pro 455      macOS 1.83kg
## 5           256GB SSD Intel Iris Plus Graphics 650      macOS 1.37kg
## 6           500GB HDD                AMD Radeon R5 Windows 10  2.1kg
##   Price_euros
## 1     1339.69
## 2      898.94
## 3      575.00
## 4     2537.45
## 5     1803.60
## 6      400.00

Our original file contains around 1303 records, described in 13 variables. The variables are listed below, the majority are categorical. Our target variable in this project is “price_euros”.

Data preprocessing

I applied some techniques to check the quality of the data and make the necessary transformations:

  • The are no missing values in our dataset.
  • I extracted numeric data from variables like Screen_Resolution, Ram, Memory and Weight. Also, we transformed into numeric types the screen_resolution and weight.
  • I grouped some categorical variables, like CPU and GPU, in the principal models.
  • I checked for other data types.
plot_intro(laptop_raw)

laptop_cl <- laptop_raw %>%
  rename_with(tolower, everything()) %>%
  transmute(
    laptop_id,
    company,
    product,
    typename,
    inches = case_when(
      inches < 14 ~ "<14",
      inches >= 14 & inches <= 16 ~ "14-16",
      inches > 16 ~ ">16",
    ),
    screen_res = str_extract(screenresolution, "\\b\\d{3,4}x\\d{3,4}\\b"),
    cpu = case_when(
      str_detect(cpu, fixed("Intel Core i7")) ~ "Intel Core i7",
      str_detect(cpu, fixed("Intel Core i5")) ~ "Intel Core i5",
      str_detect(cpu, fixed("Intel Core i3")) ~ "Intel Core i3",
      str_detect(cpu, "Intel") ~ "Intel other",
      str_detect(cpu, fixed("AMD")) ~ "AMD",
      TRUE ~ "other"
    ),
    ram = str_extract(ram, "\\d*"),
    gpu = case_when(
      str_detect(gpu, fixed("Intel")) ~ "Intel Graphics",
      str_detect(gpu, fixed("Nvidia GeForce")) ~ "Nvidia GeForce",
      str_detect(gpu, "Quadro|GTX") ~ "Nvidia Quadro/GTX",
      str_detect(gpu, fixed("ARM")) ~ "ARM",
      str_detect(gpu, fixed("AMD")) ~ "AMD",
      TRUE ~ "other"
    ),
    op_sys = opsys,
    weight = as.numeric(str_extract(weight, "\\d+\\.?\\d*")),
    price_euros
  )

memory_cl <- laptop_raw %>%
  separate(Memory, c("memorySSD", "memoryHDD"), sep = "\\+") %>%
  mutate(
    memoryx = laptop_raw$Memory,
    memorySSD = as.numeric(str_extract(memorySSD, "\\d*")),
    memoryHDD = as.numeric(str_extract(memoryHDD, "\\d+")),
  ) %>%
  mutate(
    memorySSD = replace(memorySSD, memorySSD == 1, 1000),
    memorySSD = replace(memorySSD, memorySSD == 2, 2000),
    memoryHDD = replace(memoryHDD, memoryHDD == 1, 1000),
    memoryHDD = replace(memoryHDD, memoryHDD == 2, 2000),
    memoryHDD = replace_na(memoryHDD, 0),
    totalmemory = memorySSD + memoryHDD
  ) %>%
  dplyr::select(c(laptop_ID, totalmemory))

laptop_cl <- laptop_cl %>%
  dplyr::left_join(memory_cl, by = c("laptop_id" = "laptop_ID")) %>%
  mutate(
    totalmemory = as.character(totalmemory)
  )

head(laptop_cl)
##   laptop_id company     product  typename inches screen_res           cpu ram
## 1         1   Apple MacBook Pro Ultrabook    <14  2560x1600 Intel Core i5   8
## 2         2   Apple Macbook Air Ultrabook    <14   1440x900 Intel Core i5   8
## 3         3      HP      250 G6  Notebook  14-16  1920x1080 Intel Core i5   8
## 4         4   Apple MacBook Pro Ultrabook  14-16  2880x1800 Intel Core i7  16
## 5         5   Apple MacBook Pro Ultrabook    <14  2560x1600 Intel Core i5   8
## 6         6    Acer    Aspire 3  Notebook  14-16   1366x768           AMD   4
##              gpu     op_sys weight price_euros totalmemory
## 1 Intel Graphics      macOS   1.37     1339.69         128
## 2 Intel Graphics      macOS   1.34      898.94         128
## 3 Intel Graphics      No OS   1.86      575.00         256
## 4            AMD      macOS   1.83     2537.45         512
## 5 Intel Graphics      macOS   1.37     1803.60         256
## 6            AMD Windows 10   2.10      400.00         500
# convert character variables to factors
laptop_fr <- laptop_cl %>%
  dplyr::select(-laptop_id) %>%
  mutate(across(where(is.character), factor))

head(laptop_fr)
##   company     product  typename inches screen_res           cpu ram
## 1   Apple MacBook Pro Ultrabook    <14  2560x1600 Intel Core i5   8
## 2   Apple Macbook Air Ultrabook    <14   1440x900 Intel Core i5   8
## 3      HP      250 G6  Notebook  14-16  1920x1080 Intel Core i5   8
## 4   Apple MacBook Pro Ultrabook  14-16  2880x1800 Intel Core i7  16
## 5   Apple MacBook Pro Ultrabook    <14  2560x1600 Intel Core i5   8
## 6    Acer    Aspire 3  Notebook  14-16   1366x768           AMD   4
##              gpu     op_sys weight price_euros totalmemory
## 1 Intel Graphics      macOS   1.37     1339.69         128
## 2 Intel Graphics      macOS   1.34      898.94         128
## 3 Intel Graphics      No OS   1.86      575.00         256
## 4            AMD      macOS   1.83     2537.45         512
## 5 Intel Graphics      macOS   1.37     1803.60         256
## 6            AMD Windows 10   2.10      400.00         500

Exploratory Analysis

laptop_fr %>%
  ggplot(aes(price_euros, fct_reorder(ram, price_euros))) +
  geom_boxplot() +
  labs(y = "", x = "", title = "Price (€) by Ram capacity") +
  theme_classic()

  • RAM: the boxplot of RAM versus price shows the distribution of prices for laptops with different amounts of RAM. We see that laptops with higher RAM capacity have a higher median price than laptops with lower RAM capacity.
laptop_fr %>%
  ggplot(aes(price_euros, fct_reorder(cpu, price_euros))) +
  geom_boxplot() +
  labs(y = "", x = "", title = "Price (€) by CPU type") +
  theme_classic()

laptop_fr %>%
  ggplot(aes(price_euros, fct_reorder(gpu, price_euros))) +
  geom_boxplot() +
  labs(y = "", x = "", title = "Price (€) by GPU type") +
  theme_classic()

  • CPU/GPU: the boxplot of CPU and GPU versus price would show the distribution of prices for laptops with different processor types. We see that laptops with higher-quality processors have a higher median price than laptops with lower-quality processors.
laptop_fr %>%
  ggplot(aes(price_euros, fct_reorder(inches, price_euros))) +
  geom_boxplot() +
  labs(y = "", x = "", title = "Price (€) by Screen size (inches)") +
  theme_classic()

  • Screen size: the boxplot of inches versus price shows the distribution of prices for laptops with different screen sizes. We observe that laptops with larger screen sizes have a higher median price than laptops with smaller screen sizes.
laptop_fr %>%
  ggplot(aes(price_euros, fct_reorder(totalmemory, price_euros))) +
  geom_boxplot() +
  labs(y = "", x = "", title = "Price (€) by Memory capacity") +
  theme_classic()

  • Memory: the boxplot of memory versus price shows the distribution of prices for laptops with different storage capacities. We see that not always laptops with larger storage capacity would have a higher median price than laptops with smaller storage capacity, because the total capacity combines two technologies (HDD and SSD), and probable laptops with SSD have higher prices.
laptop_fr %>%
  ggplot(aes(price_euros, fct_reorder(op_sys, price_euros))) +
  geom_boxplot() +
  labs(y = "", x = "", title = "Price (€) by Operative system") +
  theme_classic()

laptop_fr %>%
  ggplot(aes(price_euros, fct_reorder(company, price_euros))) +
  geom_boxplot() +
  labs(y = "", x = "", title = "Price (€) by Company") +
  theme_classic()

Based on the boxplots of different laptop components versus price, we can suggest that there are positive correlations between the predictors and the price, laptops with better components tend to have a higher price. We can verify this hypothesis with an ANOVA test. In terms of RAM, I confirmed the alternative hypothesis, because of the p-value of the test.

res.aov <- aov(price_euros ~ ram, data = laptop_fr)
summary(res.aov)
##               Df    Sum Sq  Mean Sq F value Pr(>F)    
## ram            8 375624250 46953031   233.2 <2e-16 ***
## Residuals   1294 260550712   201353                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
laptop_fr %>%
  ggplot(aes(price_euros)) +
  geom_histogram(bins = 13) +
  labs(y = "", x = "", title = "Laptops' price (€) distributon") +
  theme_minimal()

laptop_fr %>% ggplot(aes(y = price_euros)) +
  geom_boxplot() +
  labs(y = "", x = "", title = "Laptops' price (€) distribution") +
  theme_minimal()

Analyzing our target variable we realized that it has a right-skewed distribution, it means that the data is not symmetrically distributed around its mean, but rather has a long tail towards the right side of the distribution, with a concentration of lower values towards the left side. The implications of this is that the mean of the variable is higher than the median, and the standard deviation may not accurately reflect the spread of the data. In addition, it can cause some modeling techniques, such as linear regression, to perform poorly because it assumes that the target variable has a normal distribution.

GitHub