客服接到客户投诉,为毛直播看不了?为毛一直在卡?比较常见的答案是:农民工又挖断光纤了,黑客又攻击系统了。有毛线办法?看过FMS日志的都知道,里面没有任何有用的信息。
2012年我在某CDN时,就下决心再也不给农民工朋友摸黑,也不动用黑客攻击的大名。光纤挖断是有的,黑客攻击也是有的,问题是光纤没有断,黑客没有光顾时,卡和流不能看的概率也很大,服务器日志能给予原因吗?不能,这就是为何FMS日志,或其他不能告知原因的日志,球一样的原因了。
CDN会用到多台服务器,或者说是服务器集群,再加上调度那就是多个集群之间的调用和依赖运行。SRS是流媒体服务器,因此我只考虑流媒体系统的日志应该怎么设计,才能发现问题所在?
先考虑常见系统的日志,FMS的日志是这样的:
2015-03-24 12:23:58 3409 (s)2641173 Accepted a connection from IP:192.168.1.141, referrer: http://www.ossrs.net/players/srs_player/release/srs_player.swf?_version=1.23, pageurl: http://www.ossrs.net/players/srs_player.html?vhost=dev&stream=livestream&server=dev&port=1935 -
702111234525315439 3130 3448 normal livestream - - rtmp://192.168.1.185:1935/live/livestream rtmp://192.168.1.185:1935/live/livestream - flv - - 0 - 0 0 - - http://www.ossrs.net/players/srs_player.html?vhost=dev&stream=livestream&server=dev&port=1935 -1 -1.000000 -
crtmpd的日志确实很详细,可惜啥也看不粗来,特别是几千人在线的服务器:
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/netio/epoll/iohandlermanager.cpp:120 Handlers count changed: 15->16 IOHT_TCP_CARRIER
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/netio/epoll/tcpacceptor.cpp:185 Client connected: 192.168.1.141:54823 -> 192.168.1.173:1935
/home/winlin/tools/crtmpserver.20130514.794/sources/applications/appselector/src/rtmpappprotocolhandler.cpp:83 Selected application: flvplayback (live)
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:246 Protocol CTCP(17) <-> TCP(18) <-> [IR(19)] unregistered from application: appselector
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:257 Stream NR(5) with name `` registered to application `flvplayback` from protocol IR(19)
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:268 Stream NR(5) with name `` unregistered from application `flvplayback` from protocol IR(19)
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:257 Stream NR(6) with name `` registered to application `flvplayback` from protocol IR(19)
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/protocols/rtmp/basertmpappprotocolhandler.cpp:1043 Play request for stream name `livestream`. Start: -2000; length: -1000
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:268 Stream NR(6) with name `` unregistered from application `flvplayback` from protocol IR(19)
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:257 Stream ONR4R(7) with name `livestream` registered to application `flvplayback` from protocol IR(19)
/home/winlin/tools/crtmpserver.20130514.794/sources/common/src/utils/buffering/iobuffer.cpp:110 Unable to read data. Size advertised by network layer was 262144. Permanent error: 2
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/netio/epoll/tcpcarrier.cpp:78 Unable to read data. 192.168.1.141:54823 -> 192.168.1.173:1935 CTCP(17) <-> [TCP(18)] <-> IR(19)
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/netio/epoll/iohandlermanager.cpp:130 Handlers count changed: 16->15 IOHT_TCP_CARRIER
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/protocols/protocolmanager.cpp:45 Enqueue for delete for protocol [IR(19)]
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:268 Stream ONR4R(7) with name `livestream` unregistered from application `flvplayback` from protocol IR(19)
/home/winlin/tools/crtmpserver.20130514.794/sources/thelib/src/application/baseclientapplication.cpp:246 Protocol [IR(19)] unregistered from application: flvplayback
nginx-rtmp的日志已经有进步了,能按照连接区分:
2015/03/24 11:42:01 [info] 7992#0: *3 client connected '192.168.1.141'
2015/03/24 11:42:01 [info] 7992#0: *3 connect: app='live' args='' flashver='MAC 17,0,0,134' swf_url='http://www.ossrs.net/players/srs_player/release/srs_player.swf?_version=1.23' tc_url='rtmp://192.168.1.173:1935/live' page_url='http://www.ossrs.net/players/srs_player.html?vhost=dev&stream=livestream&server=dev&port=1935' acodecs=3575 vcodecs=252 object_encoding=3, client: 192.168.1.141, server: 0.0.0.0:1935
2015/03/24 11:42:01 [info] 7992#0: *3 createStream, client: 192.168.1.141, server: 0.0.0.0:1935
2015/03/24 11:42:01 [info] 7992#0: *3 play: name='livestream' args='' start=0 duration=0 reset=0 silent=0, client: 192.168.1.141, server: 0.0.0.0:1935
2015/03/24 11:42:20 [info] 7992#0: *3 disconnect, client: 192.168.1.141, server: 0.0.0.0:1935
2015/03/24 11:42:20 [info] 7992#0: *3 deleteStream, client: 192.168.1.141, server: 0.0.0.0:1935
可惜nginx日志只能知道这个连接的信息,而上下文都不清楚,连接也就这些信息。而SRS的日志,能知道播放连接对应的推流连接,会在后续打印出连接的摘要信息,并且可以在集群内任意追溯。
假设我知道流名称,或者不知道流名称,反正任意信息,譬如我知道播放的链接会打一个"type=Play"的标记出来,就从这一点开始。假设从源站开始:
[winlin@dev6 srs]$ grep -ina "type=Play" objs/srs.origin.log
31:[2014-08-06 10:09:34.671][trace][22288][107] client identified, type=Play, stream_name=livestream, duration=-1.00
发现有个107的链接播放了源站信息,查看它的日志:
[winlin@dev6 srs]$ grep -ina "\[107\]" objs/srs.origin.log
27:[2014-08-06 10:09:34.589][trace][22288][107] RTMP client ip=127.0.0.1
28:[2014-08-06 10:09:34.591][trace][22288][107] complex handshake success
29:[2014-08-06 10:09:34.631][trace][22288][107] connect app, tcUrl=rtmp://dev:1935/live, pageUrl=http://www.ossrs.net/players/srs_player.html?vhost=dev&stream=livestream&server=dev&port=1935, swfUrl=http://www.ossrs.net/players/srs_player/release/srs_player.swf?_version=1.23, schema=rtmp, vhost=__defaultVhost__, port=1935, app=live, args=(obj)
30:[2014-08-06 10:09:34.631][trace][22288][107] edge-srs ip=192.168.1.159, version=0.9.190, pid=22314, id=108
31:[2014-08-06 10:09:34.671][trace][22288][107] client identified, type=Play, stream_name=livestream, duration=-1.00
32:[2014-08-06 10:09:34.671][trace][22288][107] out chunk size to 60000
33:[2014-08-06 10:09:34.671][trace][22288][107] source url=__defaultVhost__/live/livestream, ip=127.0.0.1, cache=1, is_edge=0, source_id=105[105]
34:[2014-08-06 10:09:34.672][trace][22288][107] dispatch cached gop success. count=307, duration=4515
35:[2014-08-06 10:09:34.672][trace][22288][107] create consumer, queue_size=30.00, tba=44100, tbv=25
36:[2014-08-06 10:09:34.672][trace][22288][107] ignored. set buffer length to 1000
37:[2014-08-06 10:09:34.673][trace][22288][107] input chunk size to 60000
40:[2014-08-06 10:09:44.748][trace][22288][107] -> PLA time=10007, msgs=0, okbps=464,0,0, ikbps=3,0,0
41:[2014-08-06 10:09:47.805][warn][22288][107][104] client disconnect peer. ret=1004
可以看到源id是105,关键字是source_id=105
,查这个源:
[winlin@dev6 srs]$ grep --color -ina "\[105\]" objs/srs.origin.log
16:[2014-08-06 10:09:30.331][trace][22288][105] RTMP client ip=127.0.0.1
17:[2014-08-06 10:09:30.331][trace][22288][105] srand initialized the random.
18:[2014-08-06 10:09:30.332][trace][22288][105] simple handshake success.
19:[2014-08-06 10:09:30.373][trace][22288][105] connect app, tcUrl=rtmp://127.0.0.1:1936/live?vhost=__defaultVhost__, pageUrl=, swfUrl=, schema=rtmp, vhost=__defaultVhost__, port=1936, app=live, args=null
21:[2014-08-06 10:09:30.417][trace][22288][105] client identified, type=publish(FMLEPublish), stream_name=livestream, duration=-1.00
22:[2014-08-06 10:09:30.417][trace][22288][105] out chunk size to 60000
23:[2014-08-06 10:09:30.418][trace][22288][105] source url=__defaultVhost__/live/livestream, ip=127.0.0.1, cache=1, is_edge=0, source_id=-1[-1]
24:[2014-08-06 10:09:30.466][trace][22288][105] got metadata, width=768, height=320, vcodec=7, acodec=10
25:[2014-08-06 10:09:30.466][trace][22288][105] 46B video sh, codec(7, profile=100, level=32, 0x0, 0kbps, 0fps, 0s)
26:[2014-08-06 10:09:30.466][trace][22288][105] 4B audio sh, codec(10, profile=1, 2channels, 0kbps, 44100HZ), flv(16bits, 2channels, 44100HZ)
33:[2014-08-06 10:09:34.671][trace][22288][107] source url=__defaultVhost__/live/livestream, ip=127.0.0.1, cache=1, is_edge=0, source_id=105[105]
38:[2014-08-06 10:09:40.732][trace][22288][105] <- CPB time=10100, okbps=3,0,0, ikbps=332,0,0
可见这个就是ingest的连接,即编码器推流连接。已经查到了源头。
同时可以看到107这个其实是srs的回源链接,关键字是edge-srs
:
30:[2014-08-06 10:09:34.631][trace][22288][107] edge-srs ip=192.168.1.159, version=0.9.190, pid=22314, id=108
可以去边缘服务器上查它的信息,id是108:
[winlin@dev6 srs]$ grep --color -ina "\[108\]" objs/srs.log
29:[2014-08-06 10:09:34.579][trace][22314][108] edge pull connected, can_publish=1, url=rtmp://dev:1935/live/livestream, server=127.0.0.1:1936
30:[2014-08-06 10:09:34.591][trace][22314][108] complex handshake success.
31:[2014-08-06 10:09:34.671][trace][22314][108] connected, version=0.9.190, ip=127.0.0.1, pid=22288, id=107
32:[2014-08-06 10:09:34.672][trace][22314][108] out chunk size to 60000
33:[2014-08-06 10:09:34.672][trace][22314][108] ignore the disabled transcode:
34:[2014-08-06 10:09:34.672][trace][22314][108] edge change from 100 to state 101 (pull).
35:[2014-08-06 10:09:34.672][trace][22314][108] input chunk size to 60000
36:[2014-08-06 10:09:34.672][trace][22314][108] got metadata, width=768, height=320, vcodec=7, acodec=10
37:[2014-08-06 10:09:34.672][trace][22314][108] 46B video sh, codec(7, profile=100, level=32, 0x0, 0kbps, 0fps, 0s)
38:[2014-08-06 10:09:34.672][trace][22314][108] 4B audio sh, codec(10, profile=1, 2channels, 0kbps, 44100HZ), flv(16bits, 2channels, 44100HZ)
39:[2014-08-06 10:09:34.779][trace][22314][107] update source_id=108[108]
46:[2014-08-06 10:09:36.853][trace][22314][110] source url=__defaultVhost__/live/livestream, ip=192.168.1.179, cache=1, is_edge=1, source_id=108[108]
50:[2014-08-06 10:09:44.949][trace][22314][108] <- EIG time=10293, okbps=3,0,0, ikbps=441,0,0
53:[2014-08-06 10:09:47.805][warn][22314][108][4] origin disconnected, retry. ret=1007
这个边缘服务器上这个回源链接有两个客户端连接上,107和110,关键字是source_id=108
:
[winlin@dev6 srs]$ grep --color -ina "\[107\]" objs/srs.log
18:[2014-08-06 10:09:34.281][trace][22314][107] RTMP client ip=192.168.1.179
19:[2014-08-06 10:09:34.282][trace][22314][107] srand initialized the random.
20:[2014-08-06 10:09:34.291][trace][22314][107] complex handshake success
21:[2014-08-06 10:09:34.291][trace][22314][107] connect app, tcUrl=rtmp://dev:1935/live, pageUrl=http://www.ossrs.net/players/srs_player.html?vhost=dev&stream=livestream&server=dev&port=1935, swfUrl=http://www.ossrs.net/players/srs_player/release/srs_player.swf?_version=1.23, schema=rtmp, vhost=__defaultVhost__, port=1935, app=live, args=null
22:[2014-08-06 10:09:34.532][trace][22314][107] ignored. set buffer length to 800
23:[2014-08-06 10:09:34.568][trace][22314][107] client identified, type=Play, stream_name=livestream, duration=-1.00
24:[2014-08-06 10:09:34.568][trace][22314][107] out chunk size to 60000
25:[2014-08-06 10:09:34.568][trace][22314][107] source url=__defaultVhost__/live/livestream, ip=192.168.1.179, cache=1, is_edge=1, source_id=-1[-1]
26:[2014-08-06 10:09:34.579][trace][22314][107] dispatch cached gop success. count=0, duration=0
27:[2014-08-06 10:09:34.579][trace][22314][107] create consumer, queue_size=30.00, tba=0, tbv=0
28:[2014-08-06 10:09:34.579][trace][22314][107] ignored. set buffer length to 800
39:[2014-08-06 10:09:34.779][trace][22314][107] update source_id=108[108]
54:[2014-08-06 10:09:47.805][trace][22314][107] cleanup when unpublish
55:[2014-08-06 10:09:47.805][trace][22314][107] edge change from 101 to state 0 (init).
56:[2014-08-06 10:09:47.805][warn][22314][107][9] client disconnect peer. ret=1004
107是触发回源的连接。查看110这个链接:
[winlin@dev6 srs]$ grep --color -ina "\[110\]" objs/srs.log
40:[2014-08-06 10:09:36.609][trace][22314][110] RTMP client ip=192.168.1.179
41:[2014-08-06 10:09:36.613][trace][22314][110] complex handshake success
42:[2014-08-06 10:09:36.613][trace][22314][110] connect app, tcUrl=rtmp://dev:1935/live, pageUrl=http://www.ossrs.net/players/srs_player.html?vhost=dev&stream=livestream&server=dev&port=1935, swfUrl=http://www.ossrs.net/players/srs_player/release/srs_player.swf?_version=1.23, schema=rtmp, vhost=__defaultVhost__, port=1935, app=live, args=null
43:[2014-08-06 10:09:36.835][trace][22314][110] ignored. set buffer length to 800
44:[2014-08-06 10:09:36.853][trace][22314][110] client identified, type=Play, stream_name=livestream, duration=-1.00
45:[2014-08-06 10:09:36.853][trace][22314][110] out chunk size to 60000
46:[2014-08-06 10:09:36.853][trace][22314][110] source url=__defaultVhost__/live/livestream, ip=192.168.1.179, cache=1, is_edge=1, source_id=108[108]
47:[2014-08-06 10:09:36.853][trace][22314][110] dispatch cached gop success. count=95, duration=1573
48:[2014-08-06 10:09:36.853][trace][22314][110] create consumer, queue_size=30.00, tba=44100, tbv=25
49:[2014-08-06 10:09:36.853][trace][22314][110] ignored. set buffer length to 800
51:[2014-08-06 10:09:45.919][trace][22314][110] -> PLA time=8759, msgs=21, okbps=461,0,0, ikbps=3,0,0
52:[2014-08-06 10:09:46.247][warn][22314][110][104] client disconnect peer. ret=1004
可见110也是个flash播放连接。
解决了排错的问题,如何解决卡的问题?用测速了,SRS可以支持linux客户端和flash客户端直接对SRS测速,测试到SRS服务器的上行和下行速度。也就是说,页面推流前,就可以调用js接口来获取到服务器的速度,如果知道服务器地址,还可以自己选择服务器。
配合测速和日志系统,就能知道卡或者故障的用户分发的路径是什么,运维或者高级运维系统可以自己排插这个路径的问题所在。从这以后,挖断光纤的概率小很多了,而黑客攻击也只有真正被攻击时(这时候要用热备)才是公鸡了。
集成了测速客户端的客户,自己就可以排除推流端的带宽过低的故障,这种问题大约占了90%的故障。而且更加致命的是,没有办法证明,除非能测速。泥马其他测速是测试到其他服务器的速度,譬如360测速,那顶多证明用户到其他服务器的下行没有问题,和流媒体服务商的服务器有球关系?而且还是上行!
关于测速,参考:https://github.com/winlinvip/simple-rtmp-server/wiki/v1_CN_BandwidthTestTool
关于SRS可追溯日志,参考:https://github.com/winlinvip/simple-rtmp-server/wiki/v1_CN_SrsLog#tracable-log