mongodb系列~运维

一慢日志

1 分析大于N秒的慢日志

awk '$NF~/ms$/{print $1,$NF}' shard2.log|sed 's/ms//g'|awk '$2 > N {print $1,$2}'

2 在线添加索引

nohup mongo --eval " db.chenfeng.ensureIndex({"riqi":1},{background:true})" &

3 通过后台日志可以查看到索引进度

4 kill查询时间超过5s的所有请求：

db.currentOp().inprog.forEach(function(item){if(item.secs_running > 5 )db.killOp(item.opid)})

二整体性能分析

mongostat -h ip:port -u用户名 -p密码 --authenticationDatabase=admin --discover

1 insert/update/delete/query 可以发现是什么操作导致的负载问题

2 dirty: 脏数据字节的缓存百分比(总cachesize)？占用总cachesize而不是res么

used: 正在使用中的缓存百分比(总cachesize)

1 eviction_target(80% 后台evict开始淘汰) eviction_trigger(95% 用户请求开始淘汰)eviction_dirty_target(5%,后台evict开始刷脏)eviction_dirty_trigger(20%,用户请求开始刷脏)

2 db.adminCommand( { setParameter : 1, "wiredTigerEngineRuntimeConfig" : "eviction=(threads_min=4, threads_max=20)"}) 默认都是4,可以调节最大值

3 qrw arw：queue r|w(等待队列) alive r|wc(执行队列)

4 vsize res：申请内存和实际占用,为主要观察指标

mongotop -h ip:port -u用户名 -p密码 --authenticationDatabase=admin --discover

1 ns/db：具体的命名空间信息

2 total：mongod在这个命令空间上花费的总时间。

3 read/write：在这个命令空间上mongod执行读/写操作花费的时间。

三副本集相关

1 db.printSlaveReplicationInfo()//查查延迟

2 手动触发切换

1 rs.conf() 查看_id和priority

2 cfg = rs.conf()

3 rs.members[_id].priority=n

4 rs.reconfig(cfg)

5 rs.status()

四回收磁盘空间

collection:

remove() 不会回收磁盘空间,但是空间可以被mongo重用

drop() 会回收磁盘空间,直接删除物理文件

compact

回收磁盘空间(碎片整理)

3.4之前 db.runCommand({repairDatabase :1}) 全局阻塞, 空间需要滞留一倍

3.4之后 db.tablename.runCommand("compact");|db.runCommand({compact:"tablename",force:true})DB级别的读写阻塞,空间需要滞留一倍

五在线查看会话

1 过滤大于N秒的线程

db.currentOp({"active" : true, "secs_running" : { "$gt" : N } })

执行后返回关键信息

client	请求是由哪个客户端发起的。
opid	操作的opid，有需要的话，可以通过db.killOp(opid) 直接终止该操作。
secs_running/microsecs_running	这个值重点关注，代表请求运行的时间，如果这个值特别大，请看看请求是否合理。
query/ns	这个字段能看出是对哪个集合正在执行什么操作。
lock*	- 还有一些跟锁相关的参数，需要了解可以看官网文档，本文不做详细介绍。 - db.currentOp文档请参见：db.currentOp 。

2 kill会话

db.killOp(opid)

执行后返回 { "info" : "attempting to kill op", "ok" : 1 }

3 索引定点查询

db.currentOp( { $or: [ { op: "query", "query.createIndexes": { $exists: true } }, { op: "insert", ns: /\.system\.indexes\b/ } ] } )

六删除分片成员

删除分片成员执行removeshard后,会自动开启分片数据的迁移,对于未开启分片的数据库,需要利用movepRimary手动迁移

1 db.runCommand( { removeshard: "分片名称" } )

1 通常用户在removeShard返回中，如果state是ongoing表示还在move chunk，remaining字段会显示还没有move完毕的chunks数：

2 you need to drop or movePrimary these databases 提示的是需要手动move的db,需要手动触发

1 db.runCommand( { movePrimary: "未分片数据库名称", to: "目标分片名称" })

返回结果 { "primary" : "mongodb1", "ok" : 1 }

3 返回结果 { msg: "remove shard completed succesfully" , stage: "completed", host: "mongodb0", ok : 1 }

2 强制刷新每个mongos

登陆监代理mongos 通过db.adminCommand({"flushRouterConfig":1}) 强制刷新路由信息。

七备份还原

1 mongodump -h ip:port -uroot -proot --authenticationDatabase admin -d dbname -c collection_name -o backup_dir --gzip

1 默认不会备份local库,会备份admin(索引信息存在admin中)

2 如果指定 --oplog 会备份出oplog.bson

2 mongorestore -h ip:port -uroot -proot --authenticationDatabase admin -d dbname backup_dir --gzip

八 shard强制选主

1 cfg = rs.conf()

2 cfg.members[_id].priority = n

3 rs.reconfig(cfg)

手动调节priority,可以强制指定为主库,priority最大成为主,如果为0,则永远不为主,可以参与投票

九常用排除问题手段

1 查询整体活跃会话-客户端信

db.currentOp({"active" : true, "secs_running" : { "$gt" : N }}).inprog.forEach(function(item)print(item.client,item.opid,item.ns,JSON.stringify(item.query)));

2 批量KILL-SESSION-适合紧急情况,但要考虑执行后程序重试问题

db.currentOp({"active" : true, "secs_running" : { "$db.currentOp().inprog.forEach(function(item){if(item.secs_running > 1000 )db.killOp(item.opid)})

3 查询指定时间段的慢日志-确保问题日志SQL已记录

db.getProfilingLevel() db.setProfilingLevel(1,N(毫秒))

db.system.profile.find({ts : {$gt : new ISODate(2021-10-22T11:01:45.976Z),$lt : new ISODate(2021-10-22T14:01:48.976Z)}})

4 查看表是否分片

db.collections_name.status().sharded