Hello All
I been having an issue when i try to perform a "fsctl recover" on the fail-over server here is my config:
I have two identical servers from linode. I used the tutorial "the easy way" here. For the IP failover side of things I am using keepalived and from an IP/Pinging standpoint - that part seems to be working correctly. The logs show starting MASTER, starting BACKUP all works as expected (1 maybe 2 packets lost during cutover during a ping).
So:
1. I have 2 ext registered (101 and 102), i place a call from one ext to the other - the MASTER FS, makes the call and as is well. I check the FS db on both nodes while this call is up and I see data in the recovery and calls tables within it. Seems good so far...
2. I take down the eth0 port on the master and the call stops passing media (as I expect).
3. I see the IP get moved to the BACKUP
4. in the fs_cli on the BACKUP, i manually run (for now) a "fsctl recover"
5. i see the following in the BACKUP fs_cli:
227a5117-4039-41ee-98ef-350c7a15891c 2017-11-24 12:11:58.573620 [DEBUG] switch_core_state_machine.c:646 (sofia/internal/101@192.168.23.143:5060) State RESET going to sleep
2017-11-24 12:11:58.753625 [DEBUG] switch_pgsql.c:415 Query (insert into channels (uuid,direction,created,created_epoch, name,state,callstate,dialplan,context,hostname,initial_cid_name,initial_cid_num,initial_ip_addr,initial_dest,initial_dialplan,initial_context) values('64800f2b-eee1-4a90-9cb0-f55c2f7341d7','inbound','2017-11-24 12:11:58','1511543518','sofia/internal/102@cluster.testpbx.com','CS_INIT','ACTIVE','XML','cluster.testpbx.com','vg-cluster-1','My MAC','102','96.239.out.ip','101','XML','cluster.testpbx.com')) returned PGRES_FATAL_ERROR
2017-11-24 12:11:58.753625 [DEBUG] switch_pgsql.c:415 Query (insert into channels (uuid,direction,created,created_epoch, name,state,callstate,dialplan,context,hostname,initial_cid_name,initial_cid_num,initial_ip_addr,initial_dest,initial_dialplan,initial_context) values('227a5117-4039-41ee-98ef-350c7a15891c','outbound','2017-11-24 12:11:58','1511543518','sofia/internal/101@192.168.23.143:5060','CS_INIT','DOWN','XML','cluster.testpbx.com','vg-cluster-1','My MAC','102','96.239.out.ip','101','XML','cluster.testpbx.com')) returned PGRES_FATAL_ERROR
2017-11-24 12:11:58.753625 [ERR] switch_pgsql.c:656 Error executing query:
ERROR: current transaction is aborted, commands ignored until end of transaction block
It appears the BACKUP box is trying to insert the call (which I assume FS got it by doing a lookup in the replicated db during the beginning of "fsctl recover" although the fs_cli output doesn't show anything about that. and btw, i see basically the same fatal error in the postgres logs. The phones do not hang-up on the BACKUP - but no media. I also noticed this entry in log right after the above error:
64800f2b-eee1-4a90-9cb0-f55c2f7341d7 2017-11-24 12:12:13.573665 [DEBUG] sofia.c:7084 Channel sofia/internal/102@cluster.testpbx.com entering state [terminated][503]
64800f2b-eee1-4a90-9cb0-f55c2f7341d7 2017-11-24 12:12:13.573665 [NOTICE] sofia.c:8273 Hangup sofia/internal/102@cluster.testpbx.com [CS_SOFT_EXECUTE] [NORMAL_TEMPORARY_FAILURE]
64800f2b-eee1-4a90-9cb0-f55c2f7341d7 2017-11-24 12:12:13.573665 [DEBUG] switch_ivr_bridge.c:712 sofia/internal/102@cluster.testpbx.com ending bridge by request from read function
64800f2b-eee1-4a90-9cb0-f55c2f7341d7 2017-11-24 12:12:13.573665 [DEBUG] switch_ivr_bridge.c:787 BRIDGE THREAD DONE [sofia/internal/102@cluster.testpbx.com]
227a5117-4039-41ee-98ef-350c7a15891c 2017-11-24 12:12:13.573665 [DEBUG] switch_ivr_bridge.c:787 BRIDGE THREAD DONE [sofia/internal/101@192.168.23.143:5060]
I have seen a video on YT that shows this recovery in action and I see the PGRES_FATAL_ERROR in his logs also, but the call is created anyway. Maybe there is a setting I am missing?
ANY help would be appreciated!
thanks