May 10, 2021
This project was the first one assigned for my Advance Data Science and Predictive Analytics class, so I was pretty excited and feeling very advanced. And what a mysterious concept. You are an investment firm looking to invest in a used car dealership. The goal is to understand the relationship between car makes, models and price, recommend the top five vehicle manufactures to invest in for your used car business, and see if there is a relationship between fuel type and car price.
The data comes from a dataset of "classified ads for cars" stripped from the internet in Europe available in the open data community on Kaggle. I can't emphasize enough that this is a European dataset, because a lot of the can manufactures were not familiar to me, and they seem to drive a lot more diesel powered cars. Also, prices are in EUR.
A bit about the dataset. There are over 3.5 million rows (cars) in the dataset, with columns ranging from maker, model, mileage, manufacture_year, door_count, fuel_type, price_eur. There is a lot of missing data, with 15% of cars missing a maker and 32% missing a model. Colour is almost 94% blank, so there isn't going to be any colour/price analysis, that's for sure. Is that's your jam, check out a post on instamotor about car colour versus price. There they spell colour without the u. I can do that too. Color. I am a duel-citizen after all, so they both look normal and accurate to me.
Anyway, back to the assignment. So, here we are, this investment firm looking start a used cars dealership. The year is 2017 (the data stops in January of 2016) and we are digging into the data with our preferred method, Hive. It's preferred because that's what's been assigned to us in class. This is what it looks like working with hive. If you are expecting super beautiful visualizations, you should go check out another blog entry.
Working with hive, you load data into a Hadoop cluster and create a Hive table. Want to see what that looks like?
Creating a table called usedcars in hive. Column names were given based on original dataset. Yes, door_count and seat_count columns could have been ingested as integers too, but we didn't end up examining these columns in our report.
Next we should count all the cars in the table:
hive> select count(*) from usedcars;
We have 3,552,912 cars!
And there we have it, a hive table called usedcars with our 3.5 million cars loaded and ready to analyse.
Next thing we wanted to do is get rid of all the cars (rows) we don't want to include in our table. Those are the rows without a maker or model listed, cars made before the year 2000 and cars that are just too cheap or just too expensive, let's say, between 3,000 EUR and 2,000,000 EUR.
How are we going to do that? Here is the hive query:
hive>from usedcars insert into table cleanusedcars select * where model != ' ' and maker != ' ' and price_eur >= 3000.00 and price_eur <= 2000000.00 and manufacture_year >= 2000 and manufacture_year <=2017;
And that beauty line leaves us with 1,322,853 cars in our new cleanusedcars hive table.
What do we have left, what kind of cars and what prices are we looking at?
Above you can see the makers with the highest average price in the cleanusedcar table. There are four units of information displayed per row. The first number is the average price, the second number is the distinct number of models by the maker in the dataset (this number shifts to the left after Bentley because there are fewer digits in the average price of the car), the third number is the number of cars by that maker found in the dataset, and finally we have the maker itself. Here is the hive command:
hive> select round(avg(price_eur)) as avgprice, count(distinct model), count(*), maker from cleanusedcars group by maker order by avgprice desc;
This gives us a lot of information. We know there are only 3 rolls-royce and 2 aston-martin and so are not likely to recommend those for our used car dealership. We also know there are 44 different manufacturers. Lots to choose from. Exciting stuff. But how are we going to narrow down what manufacturer to recommend to start the dealership. Like, what cars are good cars to sell used? That's the question. Let's find out!
Okay, I kind of lied, before we share our really riveting investigation into which cars to recommend selling at our used car dealership, let's take a look at the fuel type and it's relationship with the price of the car. As a non-car enthused North American, I had no idea there was really a relationship at all, but in this European dataset, it's a pretty dramatic relationship. Let's have a look and see what that is.
First of all, let's count the number of cars that are diesel, gasoline and cars that don't have a fuel_type listed.
hive> select fuel_type, avg(price) as avg_price from cleanusedcars group by fuel_type;
This shows us that the majority of cars in the dataset do not have fuel type listed, and that of the cars that are labeled, there are only two types of fuel type, gasoline and diesel. It also shows that there are almost the exact same number of vehicles, 240,000, labeled as either gasoline or diesel.
So, is a diesel or gasoline car more expensive in our used car dataset?
hive> select fuel_type, avg(price) as avg_price from cleanusedcars group by fuel_type;
Turns out diesel cars are priced considerably more, almost 4000 EUR more than cars labeled with gasoline fuel type.
Alight, so that gives us a baseline. Cars labeled with diesel fuel type within the whole dataset are more expensive than cars labeled with gasoline. But what about within individual car manufacturers?
Let's have a look.
hive> select maker, fuel_type, avg(price) as avg_price from cleanusedcars group by fuel_type, maker order by maker;
Looks like this trend is consistent over almost all car manufacturers. Each line shows three pieces of data, the manufacturer, the fuel type (if blank, it was not listed), and the average price of the car. Audi (+2500), Chevrolet (+2700), Citreon (+4200), Fiat (+3200) all show diesel cars as more expensive than gasoline cars by the same manufacturer. But not all diesel cars are more expensive. What's going on? Is there a better way to show this data? Yes!
To show this much clearer, we decided to see what percentage difference the average price was between the diesel cars and gasoline cars for each manufacturer.
To prove mathematically diesel cars are more expensive than gasoline in this dataset, the follow code below will compare the average price of a gas versus diesel vehicle from the same maker. The result at the end shows the percent cost of a diesel car versus a gasoline car.
#create two new tables, one for just diesel cars and one for just gasoline cars
hive> create table dieselprice (name string, fuel string, avgprice float) stored as textfile;
hive> create table gasprice (name string, fuel string, avgprice float) stored as textfile;
#move only the cars labeled as diesel into the new diesel table and the cars labeled as gasoline into the new gasoline table, along with the manufacturers name and the avg price
hive> from clearnusedcars insert into table dieselprice select maker, fuel_type, avgprice where fuel_type like "diesel";
hive> from clearnusedcars insert into table gasprice select maker, fuel_type, avgprice where fuel_type like "gasoline";
#now join the tables, dividing the diesel cars by the gasoline cars average price
hive> SELECT dieselprice.name, ROUND((dieselprice.avgprice / gasprice.avgprice), 5) AS result FROM gasprice INNER JOIN dieselprice ON gasprice.name = dieselprice.name order by result desc;
What this shows us is that of the 36 manufacturers that had both a diesel and gasoline car listed in the dataset, only six of them (bottom six, starting with BMW) had the price of their gasoline car more expensive than diesel. For the rest, the manufacturer's diesel cars were prices more, sometime A LOT more, then the same manufacturer's gasoline cars.
There we have it, diesel cars are on average priced higher, sometimes significantly higher, than gasoline cars by the same manufacturer.
Alright, now, do you still remember what the assignment actually was? Just a reminder, our team was tasked with figuring out what car manufacturers to recommend to start our used car business. How were we going to do that and what criteria were we going to use to determine these manufacturers? Let's find out.
If we were to run a used car dealership, we wouldn't want to sell the junky cars. We also wouldn't want to sell the super expensive cars. We were thinking we would just want to sell the reliable cars. Cars that don't go down that much in value, cars that people can sell again in the future. That being the case, we wanted to sell cars that depreciated the least in value compared to other manufacturers.
Depreciation is a thing of course. For the whole cleanusedcar table we had created, car prices on average depreciated from 21,494 EUR for cars manufactured in 2015 to 7436 EUR for cars manufactured in 2000.
hive> select manufacture_year, count(*) as cars, avg(price_eur) from cleanusedcars group by manufacture_year;
Depreciation of cost of vehicle by manufacture year across all manufacturer types. Each line shows three fields, manufacturer year, count of number of vehicles with that manufacture year, average price of vehicles with that manufacture year.
What would be cool is if we could figure out what % drop in depreciation there is from our baseline year, 2016. To do this, we use the code below, and with it, we can see the percentage drop in the average price year over year compared to the base manufacture_year average price for 2016:
hive> select manufacture_year, round(avgprice,2), round(p2016,2), round((avgprice/p2016 - 1)*100,2) from (select manufacture_year, avg(price_eur) as avgprice, 1 as keya from cleanusedcars group by manufacture_year) as a LEFT JOIN (select avg(price_eur) as p2016, 1 as keyb from cleanusedcars where manufacture_year = 2016) as b on a.keya = b.keyb order by manufacture_year DESC;
This hive query gives us a the percentage drop in price, year over year, from the baseline price of cars manufactured in 2016. For example, cars manufactured in 2011 have on average depreciated 41.5%. Cars manufactured in 2004 have on average depreciated 73.46%, almost that same as cars manufactured in 2003 and 2002. Oh, and look, cars in 2000 and 2001 are on average slightly more expensive than cars manufactured in those three years (2002-2004). Go figure.
So our team made a few business decisions when it came to deciding what cars to sell at our dealership. We wanted to keep the cars relatively new, only selling cars manufactured within the past 5 years. That would place the oldest cars on our lot, upon opening, manufactured in 2011.
We also wanted to sell the cars with the least amount of depreciation over that time period. This measure is important for multiple reasons: 1) it shows the quality of the car 2) demand for these cars 3) higher resale value is a selling point for a used car. That all makes sense right? Plus, let's say the sales people at our dealership were working on commission, wouldn't it be nicer to have them sell higher priced cars?
So what car manufacturers at the least depreciation from 2016 to 2011? Let's find out!
For the analysis, we need to create two tables:
One containing the average prices of cars manufactured in 2011 (per manufacturer):
hive> CREATE TABLE avg2011 AS SELECT avg(price_eur) AS price, maker, count(*) AS total FROM cleanusedcars WHERE manufacture_year == 2011 GROUP BY maker;
The other table contains the same information but this time for cars manufactured in 2016:
hive> CREATE TABLE avg2016 AS SELECT avg(price_eur) AS price, maker, count(*) AS total FROM cleanusedcars WHERE manufacture_year == 2016 GROUP BY maker;
With the use of these two tables, we can now create a table of final results containing the information about the speed of depreciation of cars between the years 2016 and 2011 per each manufacturer. The table below shows the top 5 manufacturers, the percentage of the price from 2016 kept in 2011, the number of cars per manufacturer on the market for the years 2016 and 2011:
hive>SELECT avg2011.maker, ROUND((avg2011.price/avg2016.price)*100, 4) AS result, avg2016.total, avg2011.total FROM avg2016 INNER JOIN avg2011 ON avg2016.maker = avg2011.maker ORDER BY result DESC ;
This output gives four items per line, the name of the manufacturer, the average 2011 price as a percentage of the average 2016 price for that manufacturer, count of the number of cars for sale by that manufacturer made in 2016, count of number of cars for sale by that manufacturer made in 2011.
We exclude Maserati from our recommendation since there were not enough cars from this manufacturer on the market, only 59 in 2016 and 93 in 2011.
Therefore, based on these criteria, we recommend the following manufacturers for our used car dealership:
Lancia
Jeep
Nissan
Hyundai
Citroen
You'll notice a couple of the manufacturers are not too familiar to North American audiences, including myself, but the data has spoken, and I stand by the recommendation!
Although hive doesn't offer visuals, the fundamentals are sometimes nice. Just give me the data, no need to make it look pretty.
As it has been scraped from the internet, this dataset is not entirely trustworthy. Although we did clean the dataset a bit, I think a finer tooth comb would need to be used to get a more meaningful purposes. The best thing going for it is the size of the dataset, meaning that even with outliers and incorrect data, as a whole, it's somewhat illustrative.
This assignment was a part of York University School of Continuing Studies Certificate in Advance Data Science and Predictive Analytics class. Talented classmates included Petr, Chirag and Carole.