“我想带你一起去香格里拉,
一起转经幡”
好男人就是我,我就是——贺鲲羽。哈哈哈
DataCamp主要提供Python和R的相关课程,内容拆分比较细,教学通过文字说明结合线上编程展开,穿插小段讲解视频;形式为闯关式,趣味性十足。但讲解相对而言较浅,解决问题的方式自由度低,任务也大多不具挑战性,比较适合萌新,有一定R语言基础的同学还是去啃书比较实在。
最近我在DataCamp上刷R语言的课程,突然起意做一些笔记分享,反正不管怎么样笔记总是要记的嘛,还可以维持更新频率这样子这之后还会陆续推出其他课程的笔记分享,欢迎持续关注。
由于笔记随视频讲解完成,内容比较粗略,故附上了DataCamp课程教学过程中的一个简单案例。小伙伴们可以点击阅读全文查看markdown报告。
如果这篇分享和我的报告对大家有帮助,希望大家可以在点击阅读全文查看报告的同时帮忙给我的repo右上角点赞(star)。
笔记正文如下:
Cleaning Data in R
Eploring Raw Data
Check data structure:
class(),
dim(),
names(),
str(), orglimpse()from`dplyr`[1],
summary().
Looking at the data
head(),
tail().
Tidying Data
Tidy data:observations as rows and variables as columns, only one type of observational unite per table
Addressing messy data with`tidyr`[2]
Q:Columns are values, not variable names(宽数据库)
A:gather(data =,
key =new key column,
value =new value column, ...,
-c(col) =columns to ignore)
Q:Variables are stored in both rows and columns(常见于长数据库)
A:spread(data =,
key =bare names of keys column,
value =bare names ofvalue column)
Q:Multiple variables are stored in one column
A:separate(data =,
col =column to separate,
into =vector of new column names,
sep= "")
Q:多行合并
A:unite(data =,
col =name of new column, ...,
sep ="")
Preparing Data for Analysis
Type conversions:
as.[class](),
as.Date(),as.POSIXct()or`lubridate`[3].
String manipulation with`stringr`[4]
str_trim(): trim white spaces;
str_pad(): pad with additional cahracters;
str_detect(): detect a pattern;
str_replace(): detect and replace a pattern;
tolower("ABC")= "abc";
toupper("abc")= "ABC.
Missing dataformat:
NA;
Inf: 1/0;
NAN: 0/0;
#N/A (from Excel);
. (from SPSS, SAS);
"": Empty string or blank.
Finding and dealing with missing values
any(is.na()): whether there are missing values;
sum(is.na()): how many missing values;
summary(): where and how many;
na.omit(): remove all missings.
Finding outliers and obvious errors
summary()
boxplot(data)
hist()
Data Cleaning in R课程仅作为数据清洗的导论课,涵盖方法和函数有限,建议可以额外参考Intermediate R课程。
07/26/2018
贺鲲羽
代码:github.com/QuinninR/QuinninR-sample-analysis
报告:rpubs.com/QuinninR/407585
[1]Hadley Wickham, Romain Franois, Lionel Henry and Kirill Müller(2018). dplyr: A Grammar of Data Manipulation. R package version0.7.5.
[2]Hadley Wickham and Lionel Henry (2018). tidyr: Easily Tidy Data with'spread()' and'gather()' Functions. R package version 0.8.1.
[3] Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25.
[4] Hadley Wickham (2018). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.3.1.
领取专属 10元无门槛券
私享最新 技术干货