这些数据每隔几个月就会被抓取一次,它包含 提供的关于汽车销售的大部分相关信息,包括价格、状况、制造商、纬度/经度和 18 个其他类别等列。对于机器学习ML 项目,请考虑对位置列(例如 long/lat)进行特征工程。
# 我们可以通过计算行的数量来获得观察值的数量
## \[1\] 34677# 另外,我们可以得到数据集,并查看行数(观察值)。
## \[1\] 34677 27
有一些非常昂贵的汽车,例如梅赛德斯-奔驰 G63 AMG、宾利慕尚、玛莎拉蒂 3500 GT、保时捷 GT 等。
也有很多汽车以 1 元的价格出售。这是以最低价格发布的常见广告策略,因为大多数人将价格从最低到最高排序,因此这些广告更频繁地出现在顶部。其中大部分是经销商的误导性广告,一些是汽车零部件,一些是汽车融资的报价。这里有太多数据需要手动清理,所以我们将它们排除在外。
有点可疑的是,所有城市都有大约 5000 个观测值,并且每个城市内的百分比几乎是完美的 50/50。
我们可以从表格和图表中非常清楚地看到,车主发帖和经销商发帖的百分比几乎是完美的 50/50,而且在不同城市之间似乎根本没有差异。
我们在上面的问题 3 中看到,价格数据存在很多问题。
按城市出售的前 3 名中,有 2 名按所有者出售的产品在同一城市内的经销商出售的前 3 名中。
2022 年的本田奥德赛“只有 117102 英里”,所以这可能是 2002 年的拼写错误。
年份 = 4 的 Jeep 可能是 2004 年,因为它有一个“AM/FM 盒式磁带播放器-muli CD 播放器”。
我们可以通过使用 alpha 参数来控制绘图点的透明度,从而更好地查看密度和渗入其他区域的情况,从而对该图进行进一步改进。
请注意,在下面的点图中,不同面板中的分布几乎相同,但分布在中间列中显示出一些变化,其中fuel type = "gas"
. 因此,我们基本上可以将燃料类型从图中删除,子集只fuel type = "gas"
我们应该花一些时间清理里程表读数。例如,最大里程表读数 1234567890 只是一些广告。但是为了简单起见,我们看到里程表读数的第 99 个百分位数是 2.610^{5},因此我们将在 500,000 处修剪数据获得几乎所有分布。
绝大多数数据似乎确实呈上升趋势。但是,请注意大约 5 岁到 20 岁之间的阴影,它们的里程表读数较低。
从下面的第一个 smoothScatter 图中,超过 35 年的汽车是“老爷车”。
从下表中可以看出,雪佛兰和福特占“老爷车”的 50% 以上。特别是,由于直到 1970 年代石油危机才开始大规模进口日本汽车,因此日本“老爷车”并不多,而我们对“老爷车”的截止时间约为 1970 年。
在网站上搜索汽车时,通常是年份、品牌和型号,按顺序排列。请注意,年份和品牌(即制造商)是数据集中的独立变量。但是,请注意数据集中调用的变量是 year、make 和 model。因此,如果我们可以解析每个标题的文本字符串以提取模型,我们可以为模型导出我们自己的独立变量。
head(vposts$header, 20)
conditos = leels(vpsts$conitio)
conditon= sprintf('"%s",\\n', conditions)
# 我们将以最常见的现有类别为基础建立新的类别。sort(tble(vpst$coition))
# 现在我们可以看到,最高的里程表读数似乎是在 "一般 "和 "良好 "条件下,这有点令人惊讶。有可能人们在里程表较高时夸大了车况,试图让它听起来更吸引人。车况分布最分散的是 "残次品",这是有道理的,因为残次品汽车可能非常旧,也可能是被损坏的新汽车。
Question #1 How many observations are there in the data set?
Question #2 What are the names of the variables? and what is the class of each variable?
Question #3 What is the average price of all the vehicles? the median price? and the deciles? Displays these on a plot of the distribution of vehicle prices.
Question #4 What are the different categories of vehicles, i.e. the type variable/column? What is the proportion for each category ?
Question #5 Display the relationship between fuel type and vehicle type. Does this depend on transmission type?
Question #6 How many different cities are represented in the dataset?
Question #7 Visually display how the number/proportion of “for sale by owner” and “for sale by dealer” varies across city?
Question #8 What is the largest price for a vehicle in this data set? Examine this and fix the value. Now examine the new highest value for price.
Question #9 What are the three most common makes of cars in each city for “sale by owner” and for “sale by dealer”? Are they similar or quite different?
Question #10 Visually compare the distribution of the age of cars for different cities and for “sale by owner” and “sale by dealer”. Provide an interpretation of the plots, i.e., what are the key conclusions and insights?
Question #11 Plot the locations of the posts on a map? What do you notice?
Question #12 Summarize the distribution of fuel type, drive, transmission, and vehicle type. Find a good way to display this information.
Question #13 Plot odometer reading and age of car? Is there a relationship? Similarly, plot odometer reading and price? Interpret the result(s). Are odometer reading and age of car related?
Question #14 Identify the “old” cars. What manufacturers made these? What is the price distribution for these?
Question #15 I have omitted one important variable in this data set. What do you think it is? Can we derive this from the other variables? If so, sketch possible ideas as to how we would compute this variable.
Question #16 Display how condition and odometer are related. Also how condition and price are related. And condition and age of the car. Provide a brief interpretation of what you find.
posts by people selling vehicles. The important variable that I did not give you was the model/type of the vehicle being sold. This is very important for determining the price of the vehicle. For example, a new Volve V60 has a suggested price of 35,000, but a new S60 has a price of 43,000, and the new Toyota Yaris and Avalon are 15,000 and 32,000 respectively - a factor of 2. So we need to determine the model of the vehicle.
We also want to verify some of the data and fix it if possible. And we also want to be able to programmatically extract other information from the posts if it is present.
When doing these questions, you will very likely have to iterate by developing a regular expression, and seeing what results it gives you and adapting it. Furthermore, you will probably have to use two or more strategies when looing for a particular piece of information. This is expected; the data are not nice and regularly formatted.
Pick two models of cars, each for a different car maker, e.g., Toyota or Volvo. For each of these, separately explore the relationship between the price being asked for the vehicle, the number of miles (odometer), age of the car and condition. Does location (city) have an effect on this? Use a statistical model to be able to suggest the appropriate price for such a car given its age, mileage, and condition. You might consider a linear model, k-nearest neighbors, or a regression tree.
You need to describe why the method you chose is appropriate? what assumptions are needed and how reasonable they are? and how well if performs and how you determined this? Would you use it if you were buying or selling this type of car?
Useful Functions
strsplit(), grep(), grepl(), gregexpr(), sub(), gsub(). agrep(), adist(), nchar(), substring() The stringi and stringr packages.