阿里云服务器降配引起的服务异常

原本一个可以睡足懒觉的美好周日,却因为自己之前的马虎导致服务异常而被微信吵醒/(ㄒoㄒ)/~~
PS:本事故与阿里云无关!

故事是从一个星期之前,测试服务器的服务迁移开始的。因为公司业务调整,各个业务的相关测试服务也进行了必要的调整,服务器资源配置有的也进行了更新。
有台原来提供MySQL服务的测试服务器,用的是阿里云服务器(ECS),配置由原来的两核处理器4G内存降配为单核2G,MySQL的InnoDB引擎的缓存池大小innodb_buffer_pool_size也由3G降至1.5G,服务器降配后重启后MySQL服务正常。

然而,就在我周日想好好睡个懒觉的时候,一个早起在家加班的前端哥们儿炸了,为毛测试的Web服务不能用了,所有接口统统没有响应了。PS:我特别佩服早起撸码的同仁(`・ω・´)

我裤子都来不及提,赶紧先打开家里的老爷机,然后才开始穿衣服。原程连上服务器后,发现那个测试的Web服务居然在那台我操作过的测试数据库服务器。而我并不知道它是什么时候装进来的,先不纠结这个问题。。。

看Tomcat

我先偷偷看看Tomcat还在不在挂没挂

1
2
3
4
5
6
7
8
[bocai@iZ2 ~]$ ps -ef|grep java
500 13337 1 0 Feb04 ? 00:02:56 /usr/local/jre-7u80/bin/java
-Djava.util.logging.config.file=/usr/local/apache-tomcat-7/conf/logging.properties
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048
-Djava.endorsed.dirs=/usr/local/apache-tomcat-7/endorsed
-classpath /usr/local/apache-tomcat-7/bin/bootstrap.jar:/usr/local/apache-tomcat-7/bin/tomcat-juli.jar
-Dcatalina.base=/usr/local/apache-tomcat-7 -Dcatalina.home=/usr/local/apache-tomcat-saas
-Djava.io.tmpdir=/usr/local/apache-tomcat-7/temp org.apache.catalina.startup.Bootstrap start

Tomcat进程还活着,那就看看它的日志,看看都有啥。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[bocai@iZ2 ~]$ tail -f /usr/local/apache-tomcat-7/logs/catalina.out -n 100
2018-02-04 08:00:18,743 ERROR [com.alibaba.druid.pool.DruidDataSource] create connection error
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Could not create connection to database server. Attempted reconnect 3 times. Giving up.
at sun.reflect.GeneratedConstructorAccessor141.newInstance(Unknown Source) ~[na:na]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) ~[na:1.7.0_80]
at java.lang.reflect.Constructor.newInstance(Unknown Source) ~[na:1.7.0_80]
at com.mysql.jdbc.Util.handleNewInstance(Util.java:407) ~[mysql-connector-java-5.1.15.jar:na]
at com.mysql.jdbc.Util.getInstance(Util.java:382) ~[mysql-connector-java-5.1.15.jar:na]
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1013) ~[mysql-connector-java-5.1.15.jar:na]
......
at com.alibaba.druid.pool.DruidAbstractDataSource.createPhysicalConnection(DruidAbstractDataSource.java:1355) ~[druid-1.0.0.jar:1.0.0]
at com.alibaba.druid.pool.DruidAbstractDataSource.createPhysicalConnection(DruidAbstractDataSource.java:1409) ~[druid-1.0.0.jar:1.0.0]
at com.alibaba.druid.pool.DruidDataSource$CreateConnectionThread.run(DruidDataSource.java:1649) ~[druid-1.0.0.jar:1.0.0]
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at sun.reflect.GeneratedConstructorAccessor133.newInstance(Unknown Source) ~[na:na]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) ~[na:1.7.0_80]
at java.lang.reflect.Constructor.newInstance(Unknown Source) ~[na:1.7.0_80]
at com.mysql.jdbc.Util.handleNewInstance(Util.java:407) ~[mysql-connector-java-5.1.15.jar:na]
......
... 18 common frames omitted
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:1.7.0_80]
at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source) ~[na:1.7.0_80]
at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source) ~[na:1.7.0_80]
at java.net.AbstractPlainSocketImpl.connect(Unknown Source) ~[na:1.7.0_80]
......
... 20 common frames omitted

看完日志,发现数据库连接池连接MySQL出现问题,可能是MySQL挂掉了,赶紧的!

探MySQL

我还是先看看进程吧,万一MySQL还没有挂呢。

1
2
[bocai@iZ2 ~]$ ps -ef|grep mysql
500 13892 13812 0 11:08 pts/0 00:00:00 grep mysql

然后,MySQL服务已经挂掉了,那再看看错误日志。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
root@iZubZ: ~ # tail -f /data/mysql/bocai.err -n 100
180204 02:41:25 mysqld_safe Number of processes running now: 0
180204 02:41:25 mysqld_safe mysqld restarted
2018-02-04 02:41:26 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2018-02-04 02:41:26 0 [Note] /usr/sbin/mysqld (mysqld 5.6.25-enterprise-commercial-advanced-log) starting as process 12344 ...
2018-02-04 02:41:26 12344 [Note] Plugin 'FEDERATED' is disabled.
2018-02-04 02:41:27 12344 [Note] InnoDB: Using atomics to ref count buffer pool pages
2018-02-04 02:41:27 12344 [Note] InnoDB: The InnoDB memory heap is disabled
2018-02-04 02:41:27 12344 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2018-02-04 02:41:27 12344 [Note] InnoDB: Memory barrier is not used
2018-02-04 02:41:27 12344 [Note] InnoDB: Compressed tables use zlib 1.2.3
2018-02-04 02:41:27 12344 [Note] InnoDB: Using Linux native AIO
2018-02-04 02:41:27 12344 [Note] InnoDB: Using CPU crc32 instructions
2018-02-04 02:41:27 12344 [Note] InnoDB: Initializing buffer pool, size = 1.5G
InnoDB: mmap(206045184 bytes) failed; errno 12
2018-02-04 02:41:27 12344 [ERROR] InnoDB: Cannot allocate memory for the buffer pool
2018-02-04 02:41:27 12344 [ERROR] Plugin 'InnoDB' init function returned error.
2018-02-04 02:41:27 12344 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2018-02-04 02:41:27 12344 [ERROR] Unknown/unsupported storage engine: InnoDB
2018-02-04 02:41:27 12344 [ERROR] Aborting
2018-02-04 02:41:27 12344 [Note] Binlog end
2018-02-04 02:41:27 12344 [Note] Shutting down plugin 'partition'
2018-02-04 02:41:27 12344 [Note] Shutting down plugin 'BLACKHOLE'
2018-02-04 02:41:27 12344 [Note] Shutting down plugin 'ARCHIVE'
2018-02-04 02:41:27 12344 [Note] Shutting down plugin 'PERFORMANCE_SCHEMA'
2018-02-04 02:41:27 12344 [Note] Shutting down plugin 'INNODB_SYS_DATAFILES'
2018-02-04 02:41:27 12344 [Note] Shutting down plugin 'INNODB_SYS_TABLESPACES'
2018-02-04 02:41:27 12344 [Note] Shutting down plugin 'INNODB_SYS_FOREIGN_COLS'
2018-02-04 02:41:27 12344 [Note] Shutting down plugin 'INNODB_SYS_FOREIGN'
2018-02-04 02:41:27 12344 [Note] Shutting down plugin 'INNODB_SYS_FIELDS'
2018-02-04 02:41:27 12344 [Note] Shutting down plugin 'INNODB_SYS_COLUMNS'
2018-02-04 02:41:27 12344 [Note] Shutting down plugin 'INNODB_SYS_INDEXES'
2018-02-04 02:41:27 12344 [Note] Shutting down ......
2018-02-04 02:41:27 12344 [Note] /usr/sbin/mysqld: Shutdown complete

找到原因

从日志里面InnoDB: Cannot allocate memory for the buffer pool,我猜测应该是服务器内存不够了,看看服务器内存和MySQL配置

1
2
3
4
5
6
7
8
9
10
root@iZubZ: ~ # free -m
total used free shared buffers cached
Mem: 1875 605 1270 0 14 41
-/+ buffers/cache: 1214 661
Swap: 0 0 0
root@iZubZ: ~ # vim /etc/my.cnf
......
[mysqld]
innodb_buffer_pool_size=1536M
......

当前服务器剩余内存1270M,而MySQL配置的InnoDB缓存池大小为1.5G,MySQL无法获取足够多的内存所以挂掉了。

把坑填上

  • 这里,将缓存池大小改成1G后,重新启动MySQL,服务恢复正常。
1
2
3
4
5
6
7
8
9
10
11
12
root@iZubZ: ~ # vim /etc/my.cnf
......
[mysqld]
# innodb_buffer_pool_size=1536M
innodb_buffer_pool_size=1024
......
root@iZubZ: ~ # service mysql start
root@iZubZ: ~ # ps -ef|grep mysql
root 13078 1 0 Feb04 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe --datadir=/data/mysql --pid-file=/data/mysql/bocai.pid
mysql 13251 13078 0 Feb04 ? 00:02:42 /usr/sbin/mysqld --basedir=/usr --datadir=/data/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql
--log-error=/data/mysql/bocai.err --pid-file=/data/mysql/bocai.pid
root 13931 13896 0 11:59 pts/0 00:00:00 grep mysql
  • 因为这个不知何时进来的Tomcat配置的内存也较大,为了保证后期不出问题,我决定给ECS记上1G的虚拟缓存分区。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
root@iZubZ: ~ # free -m
total used free shared buffers cached
Mem: 1875 605 1270 0 14 41
-/+ buffers/cache: 1214 661
Swap: 0 0 0 # ECS默认没有虚拟缓存
root@iZubZ: ~ # cd /mnt # 切换到一个存放虚拟‘缓存文件’的目录
root@iZubZ: ~ # dd if=/dev/zero of=swap.disk bs=1M count=1k # 创建1G的虚拟内存文件
root@iZubZ: ~ # mkswap -f swap.disk # 创建Swap分区
root@iZubZ: ~ # swapon /mnt/swap.disk # 挂载Swap分区(卸载则需要swapoff /mnt/swap.disk 然后rm -rf /mnt/swap.disk)
root@iZubZ: ~ # vim /etc/fstab # 修改fstab文件,保证重启依然有效
......
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
/mnt/swap.disk swap swap defaults 0 0 #新增行
......
root@iZubZ: ~ # sysctl vm.swappiness=50 # 临时性的修改虚拟内存使用率,重启会恢复为0
root@iZubZ: ~ # vim /etc/sysctl.conf # 修改文件,保证重启依然有效
......
#vm.swappiness = 0
vm.swappiness = 50
......

填坑后记

那个Tomcat还是没有人承认是自己装的,但是我猜也能猜到个差不多,不吐槽他了。这还好是在测试服务器,要是在正式服务器还不得疯了!但话说回来,要是正式服务器权限控制做成这样那公司迟早要完。
但是,即便是测试服务器,他装的时候不吱声还是不地道,毁了我多好一个白日梦!