前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Stata 数据处理 | 一份常用并实用的命令清单

Stata 数据处理 | 一份常用并实用的命令清单

原创
作者头像
直立行走
修改2024-08-04 13:39:02
2610
修改2024-08-04 13:39:02
举报
文章被收录于专栏:Policy 是门科学

数据处理是一个不断积累反复的训练过程,本文所涉及的命令既有 Stata 系统自带的,也有外部用户编写的,不求面面俱到,但求有用有效。热烈欢迎对此文有兴趣的学友批评指正文内错误、推荐好用命令,版主会第一时间修改补充!

最新版本:20240804

命令数量:28

初始版本:20240731


1. browse if:精准浏览数据集

代码语言:js
复制
sysuse auto, clear
browse if substr(make, 1, 1) == "A" //只显示 make 中首字母为 A 的观测值
browse //浏览窗口恢复至整个数据

2. fre :同时显示类别变量的标签及其取值

代码语言:js
复制
ssc install fre    
sysuse auto, clear
tab foreign 
/*
 Car origin |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00 */

fre foreign

/*foreign -- Car origin
----------------------------------------------------------------
                   |      Freq.    Percent      Valid       Cum.
-------------------+--------------------------------------------
Valid   0 Domestic |         52      70.27      70.27      70.27
        1 Foreign  |         22      29.73      29.73     100.00
        Total      |         74     100.00     100.00            
---------------------------------------------------------------- */

3. mdesc:查看变量缺失值及其占比

代码语言:js
复制
ssc install mdesc   
sysuse lifeexp, clear
help mdesc
mdesc //所有变量
/*
    Variable    |     Missing          Total     Percent Missing
----------------+-----------------------------------------------
         region |           0             68           0.00
        country |           0             68           0.00
      popgrowth |           0             68           0.00
           lexp |           0             68           0.00
          gnppc |           5             68           7.35
      safewater |          28             68          41.18
----------------+----------------------------------------------- */

mdssc safewater //指定变量

4. ereplace:一步实现变量数值替换

代码语言:js
复制
ssc install ereplace
sysuse auto, clear
ereplace mpg = max(mpg) //相比于 egen 而言便捷许多

5. distinct:查看变量非重复取值数量

代码语言:js
复制
sysuse lifeexp, clear
distinct
distinct, missing
distinct safewater 
/*
           |        Observations
           |      total   distinct
-----------+----------------------
 safewater |         40         29 */

distinct safewater, missing  //缺失值被算作一类特殊“数值”
/*
           |        Observations
           |      total   distinct
-----------+----------------------
 safewater |         68         30 */

6. bysort var1(var2): keep if _n= 1:保留组内最大值

代码语言:js
复制
sysuse census, clear

*保留各区域内人口数最小的观测值
bysort region(pop): gen num = _n //region(pop) 表示 pop 在 region 内排序
keep if num ==1
drop num

*一步实现上述命令
bysort region (pop): keep if _n==1 

7. isid:变量观测值是否唯一

代码语言:js
复制
sysuse auto, clear
isid foreign //variable foreign does not uniquely identify the observations
isid make
isid foreign make

8. levelsof:变量不同取值的数量

代码语言:js
复制
sysuse auto, clear

levelsof mpg
return list
dis r(r) //mpg有21个不同的取值

levelsof rep78
return list
dis r(r) //rep78有5个不同的取值

9. duplicates:处理重复的观测值

代码语言:txt
复制
sysuse auto, clear 
help duplicates
duplicates report price  
duplicates report rep78
duplicates example rep78
duplicates drop rep78, force

10. trim():消除字符形变量中的空格

代码语言:js
复制
clear all 

input str12 str 
"String A " 
" String B " 
" String C" 
end 

replace str = trim(str) 

11. 生成新变量的同时控制其位置

代码语言:js
复制
sysuse auto, clear
gen price_k = price/1000, before(price)     //price_k 位于 price 之前
*字符变量小写
gen make_lower = lower(make), after(make)   //make_lower 位于 make 之后

12. 提取字符型变量的元素

代码语言:js
复制
sysuse auto, clear 
gen first = substr(make, 1, 2), after(make) //提取 make 变量中的前 2 个字母

13. _n 和 _N

代码语言:js
复制
sysuse auto, clear
keep rep78
sort rep78

gen n = _n                   //观测值序号 

bys rep78: gen group_n = _n  //按组分割后的观测值序号

gen N = _N                   //观测值总数        
      
bys rep78: gen group_N = _N  //按组分割后的观测值总数

list in 1/15, sepby(rep78)
/*
     +-------------------------------------+
     | rep78    n   group_n    N   group_N |
     |-------------------------------------|
  1. |     1    1         1   74         2 |
  2. |     1    2         2   74         2 |
     |-------------------------------------|
  3. |     2    3         1   74         8 |
  4. |     2    4         2   74         8 |
  5. |     2    5         3   74         8 |
  6. |     2    6         4   74         8 |
  7. |     2    7         5   74         8 |
  8. |     2    8         6   74         8 |
  9. |     2    9         7   74         8 |
 10. |     2   10         8   74         8 |
     |-------------------------------------|
 11. |     3   11         1   74        30 |
 12. |     3   12         2   74        30 |
 13. |     3   13         3   74        30 |
 14. |     3   14         4   74        30 |
 15. |     3   15         5   74        30 |
     +-------------------------------------+ */

14. 使用循环批量修改变量名称

代码语言:js
复制
sysuse census, clear 

*全部变量
foreach var of varlist _all {     
        rename `var' `var'_new 
        }
        
*指定变量        
foreach var of varlist state state2 region pop { 
        rename `var' `var'_new 
        }

15. total():观测值总数

代码语言:js
复制
sysuse auto, clear  
gen one = 1, before(make)     //before() 选项在第10条中有介绍

egen total_all = total(one)   //样本观测值总数

bysort rep78: egen total_rep78 = total(one)  //按照rep78分类的子样本观测值数量

order total_all total_rep78 rep78
list total_all total_rep78 rep78 in 1/15, sepby(rep78)
/*
     +-----------------------------+
     | total_~l   total~78   rep78 |
     |-----------------------------|
  1. |       74          2       1 |
  2. |       74          2       1 |
     |-----------------------------|
  3. |       74          8       2 |
  4. |       74          8       2 |
  5. |       74          8       2 |
  6. |       74          8       2 |
  7. |       74          8       2 |
  8. |       74          8       2 |
  9. |       74          8       2 |
 10. |       74          8       2 |
     |-----------------------------|
 11. |       74         30       3 |
 12. |       74         30       3 |
 13. |       74         30       3 |
 14. |       74         30       3 |
 15. |       74         30       3 |
     +-----------------------------+ */

16. xtile:生成变量的分位数类别

代码语言:js
复制
sysuse auto, clear 
keep price

*四分位数类别
xtile price_quartile = price, nq(4) 

*十分位数类别
xtile price_decile   = price, nq(10)

*第 99th 分位数取值
egen high_price   = pctile(price), p(99)

*第 50th 分位数取值
egen median_price = pctile(price), p(50) 

list in 1/10, sep(0)
/*
     +----------------------------------------------------+
     |  price   pri~tile   pri~cile   high_p~e   median~e |
     |----------------------------------------------------|
  1. |  4,099          1          2      15906     5006.5 |
  2. |  4,749          2          5      15906     5006.5 |
  3. |  3,799          1          1      15906     5006.5 |
  4. |  4,816          2          5      15906     5006.5 |
  5. |  7,827          4          8      15906     5006.5 |
  6. |  5,788          3          7      15906     5006.5 |
  7. |  4,453          2          4      15906     5006.5 |
  8. |  5,189          3          6      15906     5006.5 |
  9. | 10,372          4          9      15906     5006.5 |
 10. |  4,082          1          2      15906     5006.5 |
     +----------------------------------------------------+  */

17. asgen:计算加权平均

代码语言:js
复制
capture ssc install asgen
sysuse census, clear
keeporder state region medage pop

bys region: asgen medage_m1 = medage       //均值
bys region: egen  medage_m2 = mean(medage) //与上方命令等价

bys region: asgen medage_w = medage, weight(pop) //加权均值
duplicates drop region, force

list region medage_m1 medage_m2 medage_w 
/*
     +--------------------------------------------+
     | region    medage_m1   medage~2    medage_w |
     |--------------------------------------------|
  1. | NE        31.233333   31.23333    31.82184 |
  2. | N Cntrl      29.525     29.525   29.634891 |
  3. | South      29.61875   29.61875   29.799904 |
  4. | West      28.284616   28.28462   29.348818 |
     +--------------------------------------------+ */

18. collapse:进行组内计算

代码语言:js
复制
sysuse census, clear
gen N = 1, before(region)
sort region
collapse (mean) medage (sum) N, by(region)
list
/*
     +-----------------------+
     | region    medage    N |
     |-----------------------|
  1. | NE         31.23    9 |
  2. | N Cntrl    29.52   12 |
  3. | South      29.62   16 |
  4. | West       28.28   13 |
     +-----------------------+ */

sysuse census, clear
collapse (mean) medage [aweight=pop], by(region)  //按 pop 变量加权
/*
     +------------------+
     | region    medage |
     |------------------|
  1. | NE         31.82 |
  2. | N Cntrl    29.63 |
  3. | South      29.80 |
  4. | West       29.35 |
     +------------------+ */

19. seq:生成重复的整数序列

代码语言:js
复制
capture ssc install seq
sysuse auto, clear
keep rep78
sort rep78

*repeat 1 2 3 4 5
seq rep1, from(1) to(5)

*repeat 1 1 2 2 3 3 4 4 5 5
seq rep2, from(1) to(5) block(2)

*repeat 1 2 3 ...
seq rep3, by(rep78)

list in 1/15, clean sepby(rep78)
/*
       rep78   rep1   rep2   rep3  
  1.       1      1      1      1  
  2.       1      2      1      2  
  3.       2      3      2      1  
  4.       2      4      2      2  
  5.       2      5      3      3  
  6.       2      1      3      4  
  7.       2      2      4      5  
  8.       2      3      4      6  
  9.       2      4      5      7  
 10.       2      5      5      8  
 11.       3      1      1      1  
 12.       3      2      1      2  
 13.       3      3      2      3  
 14.       3      4      2      4  
 15.       3      5      3      5   */

20. rowtotal:变量间求和

代码语言:js
复制
sysuse auto, clear
keep price mpg rep78

*只要有缺失值相加后仍为缺失值
gen sum1 = price + mpg + rep78

*rowtatal将缺失值视为 0
egen sum2 = rowtotal(price mpg rep78) 
egen sum3 = rowtotal(price-rep78)

list in 1/10, sep(0)
/*
     +----------------------------------------------+
     |  price   mpg   rep78    sum1    sum2    sum3 |
     |----------------------------------------------|
  1. |  4,099    22       3    4124    4124    4124 |
  2. |  4,749    17       3    4769    4769    4769 |
  3. |  3,799    22       .       .    3821    3821 |
  4. |  4,816    20       3    4839    4839    4839 |
  5. |  7,827    15       4    7846    7846    7846 |
  6. |  5,788    18       3    5809    5809    5809 |
  7. |  4,453    26       .       .    4479    4479 |
  8. |  5,189    20       3    5212    5212    5212 |
  9. | 10,372    16       3   10391   10391   10391 |
 10. |  4,082    19       3    4104    4104    4104 |
     +----------------------------------------------+ */

21. 生成随机数或随机变量

代码语言:js
复制
sysuse auto, clear
set seed 123

*服从均匀分布
gen r  = runiform(), before(make)        //0-1之间随机数
gen r1 = runiform(1,10), before(make)    //1-10之间随机数
gen r2 = runiformint(1,10), before(make) //1-10之间的整数随机数

*服从正态分布
gen rn  = rnormal(), before(make)         //标准正态分布(0, 1)
gen rn1 = rnormal(1), before(make)       //正态分布(1, 1)
gen rn2 = rnormal(5, 2), before(make)    //正态分布(5, 2)

*更多函数:Random-number functions

22. clonevar:生成完全一致的变量

代码语言:js
复制
sysuse auto2, clear
keep rep78

gen rep78_2 = rep78

clonevar rep78_3 = rep78

list in 1/10
/*
     +-----------------------------+
     |   rep78   rep78_2   rep78_3 |
     |-----------------------------|
  1. | Average         3   Average |
  2. | Average         3   Average |
  3. |       .         .         . |
  4. | Average         3   Average |
  5. |    Good         4      Good |
     |-----------------------------|
  6. | Average         3   Average |
  7. |       .         .         . |
  8. | Average         3   Average |
  9. | Average         3   Average |
 10. | Average         3   Average |
     +-----------------------------+ */
tab rep78,m
tab rep78_2, m
tab rep78_3, m

23. keeporder:一步实现 keep 和 order

代码语言:js
复制
 ssc install keeporder
 
 *keep + order
 sysuse auto, clear
 keep foreign rep78 make
 order foreign rep78 make
 
 *keeporder
 sysuse auto, clear
 keeporder foreign rep78 make

24. gsort:由大到小排序

代码语言:js
复制
*sort 将变量由小到大排序
sysuse auto, clear

*由小达大两者均可
sort  mpg
gsort mpg

**由大到小只能gsort
gsort -mpg

*应用:将变量按类别生成排名
 bys foreign (-mpg): gen n=_n //- invalid name (报错无法执行)

 gsort foreign -mpg      //先排序
 by foreign: gen n = _n  //再生成

*下面的方式也可等价实现
 gen mpg_rev = -mpg
 bys foreign (mpg_rev): gen n=_n
 drop mpg_rev

25. expand:按倍数复制观测行

代码语言:js
复制
sysuse auto, clear
keeporder make mpg 
expand 2  //按现有观测行的2倍复制观测行
sort make

26. 生成滞后(lag)与超前(lead)的变量

代码语言:js
复制
sysuse uslifeexp, clear
keeporder year le
gen lag_le =  le[_n-1], after(le)
gen lead_le = le[_n+1], after(lag_le)
/*
     +--------------------------------+
     | year     le   lag_le   lead_le |
     |--------------------------------|
  1. | 1900   47.3        .      49.1 |
  2. | 1901   49.1     47.3      51.5 |
  3. | 1902   51.5     49.1      50.5 |
  4. | 1903   50.5     51.5      47.6 |
  5. | 1904   47.6     50.5      48.7 |
  6. | 1905   48.7     47.6      48.7 |
  7. | 1906   48.7     48.7      47.6 |
  8. | 1907   47.6     48.7      51.1 |
  9. | 1908   51.1     47.6      52.1 |
 10. | 1909   52.1     51.1        50 |
 11. | 1910     50     52.1      52.6 |
 12. | 1911   52.6       50      53.5 |
 13. | 1912   53.5     52.6      52.5 |
 14. | 1913   52.5     53.5      54.2 |
 15. | 1914   54.2     52.5      54.5 |
     +--------------------------------+ */

27. recode:将连续变量生成新分类变量

代码语言:js
复制
webuse fullauto, clear
fre rep77 rep78 
recode rep77 rep78 (1 2 = 1 "Below average")   ///
                   (3   = 2 "Average")         ///
                   (4 5 = 3 "Above average"),  ///
                   pre(new) label(newrep)
fre newrep77 newrep78

label list repair newrep
/*
repair:
           1 Poor
           2 Fair
           3 Average
           4 Good
           5 Excellent
newrep:
           1 Below average
           2 Average
           3 Above average */

28. carryforward:向下填充

代码语言:js
复制
clear all

input id  time  y  
       1   1   1.2  
       1   3   2.4  
       1   4   3.4  
       1   7   3.2  
       1   9   2.4  
       2   3   1.8  
       2   4   5.6  
       2   6   4.3  
       3   2   2.3  
       3   4   4.5  
       3   7   6.7 
end 

list, sepby(id)
tsset id time  //设定时间序列数据
tsfill, full
list, sepby(id)

bysort id: carryforward y, gen(y1)  //直接向下填充

gsort id -time //以此实现 carry backforward
bysort id: carryforward y1, gen(y2) //反向调整后向下填充

gsort id time
/*
     +-----------------------------+
     | id   time     y    y1    y2 |
     |-----------------------------|
  1. |  1      1   1.2   1.2   1.2 |
  2. |  1      2     .   1.2   1.2 |
  3. |  1      3   2.4   2.4   2.4 |
  4. |  1      4   3.4   3.4   3.4 |
  5. |  1      5     .   3.4   3.4 |
  6. |  1      6     .   3.4   3.4 |
  7. |  1      7   3.2   3.2   3.2 |
  8. |  1      8     .   3.2   3.2 |
  9. |  1      9   2.4   2.4   2.4 |
     |-----------------------------|
 10. |  2      1     .     .   1.8 |
 11. |  2      2     .     .   1.8 |
 12. |  2      3   1.8   1.8   1.8 |
 13. |  2      4   5.6   5.6   5.6 |
 14. |  2      5     .   5.6   5.6 |
 15. |  2      6   4.3   4.3   4.3 |
 16. |  2      7     .   4.3   4.3 |
 17. |  2      8     .   4.3   4.3 |
 18. |  2      9     .   4.3   4.3 |
     |-----------------------------|
 19. |  3      1     .     .   2.3 |
 20. |  3      2   2.3   2.3   2.3 |
 21. |  3      3     .   2.3   2.3 |
 22. |  3      4   4.5   4.5   4.5 |
 23. |  3      5     .   4.5   4.5 |
 24. |  3      6     .   4.5   4.5 |
 25. |  3      7   6.7   6.7   6.7 |
 26. |  3      8     .   6.7   6.7 |
 27. |  3      9     .   6.7   6.7 |
     +-----------------------------+ */

...... To be revised continuously ......

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1. browse if:精准浏览数据集
  • 2. fre :同时显示类别变量的标签及其取值
  • 3. mdesc:查看变量缺失值及其占比
  • 4. ereplace:一步实现变量数值替换
  • 5. distinct:查看变量非重复取值数量
  • 6. bysort var1(var2): keep if _n= 1:保留组内最大值
  • 7. isid:变量观测值是否唯一
  • 8. levelsof:变量不同取值的数量
  • 9. duplicates:处理重复的观测值
  • 10. trim():消除字符形变量中的空格
  • 11. 生成新变量的同时控制其位置
  • 12. 提取字符型变量的元素
  • 13. _n 和 _N
  • 14. 使用循环批量修改变量名称
  • 15. total():观测值总数
  • 16. xtile:生成变量的分位数类别
  • 17. asgen:计算加权平均
  • 18. collapse:进行组内计算
  • 19. seq:生成重复的整数序列
  • 20. rowtotal:变量间求和
  • 21. 生成随机数或随机变量
  • 22. clonevar:生成完全一致的变量
  • 23. keeporder:一步实现 keep 和 order
  • 24. gsort:由大到小排序
  • 25. expand:按倍数复制观测行
  • 26. 生成滞后(lag)与超前(lead)的变量
  • 27. recode:将连续变量生成新分类变量
  • 28. carryforward:向下填充
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档