Experimenting with a full load-sharing cluster

Status
Not open for further replies.

iota

New Member
May 29, 2020
24
9
3
USA
SRV
sip.aws.domain.com
10 50 5060 server-0.aws.domain.com
10 50 5060 server-1.aws.domain.com

I don't think you need this record... SRV records need to start with _sip._tcp as far as I've read. If you're trying to use it for http load balancing, I'm afraid no browsers honor this. Kind of sucks.

Thanks guys for mentioning that there can be multiple A records for the same domain address. This was very helpful.

Now it seems that just having the A records and the SRV records isn't enough... I'm just getting a round robin across the A records... (this is on one of the new Grandstream GRP2614 deskphones... even though I told it to specifically use SRV to resolve DNS).

I'm going to try a NAPTR record to see if this fixes it so that the SRV priority and weight parameters work.
 

gflow

Active Member
Aug 25, 2019
268
31
28
Ok thanks phonesimon. I tested this and can confirm that works. I was not expecting it to work when both servers happened to bind to the same IP on the internal profile however it works with calls going to both servers.

Now the only thing I can not figure out how to test is the client side "failover". Testing on Bria Mobile if I crash one of the servers during an active call the app doesn't seem to go next in list on the SRV. However hanging up the call and immediately placing another I hit the remaining up server.

Any settings DNS or Freeswitch wise to encourage the other SRV servers to pick up, and the client to try, the failed servers connection? Here are the DNS settings so far;

SRV
_sip._tcp.sip.aws.domain.com
10 50 5060 server-0.aws.domain.com
10 50 5060 server-1.aws.domain.com

SRV
_sip._udp.sip.aws.domain.com
10 50 5060 server-0.aws.domain.com
10 50 5060 server-1.aws.domain.com

SRV
sip.aws.domain.com
10 50 5060 server-0.aws.domain.com
10 50 5060 server-1.aws.domain.com

A
sip.aws.domain.com
1.2.3.4
2.3.4.5


Check out this link, it might help to keep the calls alive during failover: https://freeswitch.org/confluence/display/FREESWITCH/High+Availability#HighAvailability-TrackCalls
 

mydigitalself

Member
Oct 20, 2019
71
7
8
When you did this initial test, was it on a cloud provider such as AWS and did you need to bind to a non-local IP address?

I see in the logs that the xml_handler wants to send the call to the other server; however does not and nothing shows in sngrepo or tcpdump as an attempt.
 

Mikey

New Member
Feb 10, 2020
15
1
3
55
When you did this initial test, was it on a cloud provider such as AWS and did you need to bind to a non-local IP address?

I see in the logs that the xml_handler wants to send the call to the other server; however does not and nothing shows in sngrepo or tcpdump as an attempt.

AWS and initially I was binding to public IP. However I ended up switching to a private IP on the cluster and sticking OpenSips in-front of it as the only internet facing servers
 

kt351b

Member
Feb 24, 2020
33
1
6
26
Test Results (so far)

Test setup:
  • a domain called "test.example.com" set up in FusionPBX
  • extensions 1000 and 1001
  • conference room 2000
  • DNS SRV for test.example.com pointing to node1 and node2 with equal weight/priority
Registered 1000 and 1001 and forced them to use node1 (1000) and node2 (1001) by specifying the proxy setting in the SIP client. So we are testing cross-cluster domain calls.

Works
  • extension-to-extension calls (both directions)
  • call hold/resume
  • blind transfer
  • attended transfer
  • conference
    • whoever starts the conference by dialing 2000 first hosts the conference on his node; when the other extension calls in, his call is routed over to that node to join the conference
  • call park and park retrieval
  • inbound calls from PSTN provider to either server (using DNS SRV pointing to external profile port 5080)
(More to come)
Thank you for your notes, they inspired me to try FusionPBX master-master at my new voip project. I faced some strange problem, maybe you could explain to me what am I doing wrong.
1) I want to install at server A FusionPBX with FreeSWITCH, at server B the same, and a database PSQL at server C.
So, servers A and B - FusionPBX, server C - database.
2) At server C I created databases from the install script, added rules to pg_hba.conf and so on.
3) Installed FusionPBX at servers A and B, changed the database credentials in config.sh script before installation. Installation was without errors. And checked that I can get SQL queries to the server C from servers A and B. Servers A and B use the same credentials (user fusionpbx and the same pass).
4) Done all as you wrote in comments #1-#6.
5) Now I register two extensions 1001, 1002 at the softphone at my PC and 1 extension 1001 at the softphone at my mobile phone. Two of them located at server A, another one at server B.
1000, 1002 - registered at server A
1001 - registered at server B
I can call from 1000 to 1002.
But I can't reach the extension 1001 registered at my mobile phone, I see the registration at FusionPBX Registration menu, see it in the freeswitch.registrations table, but got such error at fs_cli:

[ERR] switch_core_sqldb.c:1369 SQL ERR: [SELECT hostname FROM registrations WHERE reg_user = '1004' AND realm = 'fortest' AND to_timestamp(expires) > NOW()] no such function: NOW

When I do this request from the server's CLI, I got:
CLI: psql -h 10.20.30.1 -U fusionpbx -d freeswitch -c"SELECT hostname FROM registrations WHERE reg_user = '1001' AND realm = 'fortest' AND to_timestamp(expires) > NOW();"
Password for user fusionpbx:
hostname
-----------
fusion149
(1 row)

So, I can get this record from the server using credentials from the ${dsn}.
eval ${dsn} also shows me this variable from fs_cli.

But when I register all those extensions at my PC, everything works well! I tried to register from the mobile network (thought that it is because of NAT), tried different softphones (GS Wave, CSipSimple) and no luck.
I added to the /etc/hosts the names and ip-address of the servers as I see it in the database freeswitch.registrations.host.
I thought that I made a mistake while creating the database, installed the FusionPBX to another server, made SQL dump, and restored it at the database server, but no luck, the same problem occurs.

Sometimes it works, and I call make a call from the mobile softphone to my PC, but then this problem became again...
 
Last edited:

agile

New Member
Oct 21, 2020
27
2
3
42
Hi All I used the same method to setup my cluster with FusionPXB 4.5 on Debian 10.
I have a strange issue an incoming call from PSTN comes in rings for a split second and then goes to the voice message. I even rebuild the fusion server. The setup is a cluster this issue is replicated only when I do failover from A to B node and then fail-back to A again.
I have termination setup from two different sip trunk providers and both the numbers when I ring - rings for a split second and goes to voice message

Before the call fails i think this is the only error i see in the fs_cli :
2021-03-17 22:52:24.561421 [ERR] switch_core_sqldb.c:1369 SQL ERR: [SELECT hostname FROM registrations WHERE reg_user = '105' AND realm = 'cc1.pbx.vcloudxi.com' AND to_timestamp(expires) > NOW()] no such function: NOW

full log on Pastebin : https://pastebin.com/vFNgqK1F




Any adice on how to fix this will be greately appreciated.

sngrep :

217.10.68.151:5060 172.31.44.123:5080│Record-Route: <sip:217.10.68.151;lr>
──────────┬───────── ──────────┬─────────│Record-Route: <sip:172.20.40.8;lr>
20:48:48.506773 │ INVITE (SDP) │ │Record-Route: <sip:217.10.68.137;lr>
+0.001383 │ ──────────────────────────> │ │Via: SIP/2.0/UDP 217.10.68.151;branch=z9hG4bK256b.219b0c0c69454fc86d32a24c7f2b8e68.0
20:48:48.508156 │ 100 Trying │ │Via: SIP/2.0/UDP 172.20.40.8;branch=z9hG4bK256b.2dc301159e9e2d6add1801c6ecbbc12c.1
+0.374800 │ <────────────────────────── │ │Via: SIP/2.0/UDP 217.10.68.137;branch=z9hG4bK256b.eed111989f48ab036e95ba08edaa90f0.0
20:48:48.882956 │ 183 Session Progress (SDP) │ │Via: SIP/2.0/UDP 217.10.77.79:5060;branch=z9hG4bKPje46ccf57-7c98-4a13-8d01-a46bc5af7753
+0.118907 │ <────────────────────────── │ │From: "07877974711" <sip:07877974711@sipconnect.sipgate.co.uk>;tag=df9d7058-ecd9-4917-96cf-73b5af2a93ab
20:48:49.001863 │ CANCEL │ │To: <sip:00442034684457@sipconnect.sipgate.co.uk>
+0.000139 │ ──────────────────────────> │ │Contact: <sip:07877974711@217.10.77.79:5060>
20:48:49.002002 │ 200 OK │ │Call-ID: 159ffc2f-d30a-46a6-94af-d64fdf21b787
+0.000056 │ <────────────────────────── │ │CSeq: 2596 INVITE
20:48:49.002058 │ 487 Request Terminated │ │Allow: OPTIONS, SUBSCRIBE, NOTIFY, PUBLISH, INVITE, ACK, BYE, CANCEL, UPDATE, MESSAGE, REFER
+0.118514 │ <────────────────────────── │ │Supported: replaces, norefersub
20:48:49.120572 │ ACK │ │Max-Forwards: 67
│ ──────────────────────────> │ │Content-Type: application/sdp
│ │ │Content-Length: 371
│ │ │
│ │ │v=0
│ │ │o=- 600715110 600715110 IN IP4 212.9.44.166
│ │ │s=sGW
│ │ │c=IN IP4 212.9.44.166
│ │ │t=0 0
│ │ │m=audio 20792 RTP/AVP 8 0 107 9 18 3 101
│ │ │a=maxptime:20
│ │ │a=rtpmap:8 PCMA/8000
│ │ │a=rtpmap:0 PCMU/8000
│ │ │a=rtpmap:107 opus/48000/2
│ │ │a=rtpmap:9 G722/8000
│ │ │a=rtpmap:18 G729/8000
│ │ │a=rtpmap:3 GSM/8000
│ │ │a=rtpmap:101 telephone-event/8000
│ │ │a=fmtp:101 0-16
│ │ │a=sendrecv
│ │ │a=rtcp:20793
│ │ │a=ptime:20
 
Last edited:

kt351b

Member
Feb 24, 2020
33
1
6
26
.
Before the call fails i think this is the only error i see in the fs_cli :
2021-03-17 22:52:24.561421 [ERR] switch_core_sqldb.c:1369 SQL ERR: [SELECT hostname FROM registrations WHERE reg_user = '105' AND realm = 'cc1.pbx.vcloudxi.com' AND to_timestamp(expires) > NOW()] no such function: NOW
Hello. I have the same error at this setup.
This error "generates" a script (path for Debian system):
/usr/share/freeswitch/scripts/app/xml_handler/resources/scripts/directory/directory.lua
I change the NOW() function to ":ts" and added "ts" variable to the script also (ts = os.time(os.date("!*t")):

Code:
-get the destination hostname from the registration
freeswitch.consoleLog("notice", " local_hostname " .. local_hostname ..  "\n");
local params = {reg_user=reg_user, domain_name=domain_name, ts = os.time(os.date("!*t")) }
local sql = "SELECT hostname FROM registrations "
.. "WHERE reg_user = :reg_user "
.. "AND realm = :domain_name ";
if (database["type"] == "mysql") then
      params.now = os.time();
      sql = sql .. "AND expires > :now ";
else
     sql = sql .. "AND expires > :ts ";
end
if (debug["sql"]) then
     freeswitch.consoleLog("notice", "[xml_handler] SQL: " .. sql .. "; params:" .. json.encode(params) .. "\n");
end
dbh_switch:query(sql, params, function(row)
        database_hostname = row["hostname"];
end);
--freeswitch.consoleLog("notice", "[xml_handler] sql: " .. sql .. "\n");
--hostname was not found set USE_FS_PATH to false to prevent a database_hostname concatenation error
if (database_hostname == nil) then
          freeswitch.consoleLog("notice", "----------------- database_hostname nil ---------------- ");
          USE_FS_PATH = false;
end
--close the database connection
dbh_switch:release();

But the script gives the output:
freeswitch.consoleLog("notice", "----------------- database_hostname nil ---------------- ");
That means that program entered to "if (database_hostname == nil) then" block.
But when I copy the SQL query and make it from the 'fusionpbx' user at the freeswitch database, I got the result with the "hostname" (in my case the result is:
ip-172-31-12-59). And I don't understand why lua script thinks that the result is "nil".....

Output from the fs_cli:

switch_cpp.cpp:1447 [xml_handler] SQL: SELECT hostname FROM registrations WHERE reg_user = :reg_user AND realm = :domain_name AND expires > :ts ; params:{"reg_user":"306","domain_name":"fortest","ts":1616088593}
2021-03-18 17:29:53.212124 [NOTICE] switch_cpp.cpp:1447 ----------------- database_hostname nil ----------------

Then I run this query at the database using the user, host, password, database and port from the config.lua, 'database.switch' variable:
freeswitch=> SELECT hostname FROM registrations WHERE reg_user = '306' AND realm = 'fortest' AND expires > 1616088593;
hostname
-----------------
ip-172-31-12-59
(1 row)
 
Last edited:

kt351b

Member
Feb 24, 2020
33
1
6
26
Actually, I think I found the problem. I enabled query log at Postgresql server, made some calls and I don't see the query from the fs_cli:
2021-03-18 19:56:44.283438 [NOTICE] switch_cpp.cpp:1447 [xml_handler] SQL: SELECT hostname FROM registrations WHERE reg_user = :reg_user AND realm = :domain_name AND expires > :ts ; params:
{"reg_user":"101","ts":1616083004,"domain_name":"fortest"}
2021-03-18 19:56:44.283438 [NOTICE] switch_cpp.cpp:1447 ----------------- database_hostname nil ----------------

But I couldn't find this query in the query log:
root@DNS:~# grep 'reg_user' /var/lib/postgresql/11/main/pg_log/postgresql-2021-03-18_194312.log | grep "select"
root@DNS:~# grep 'reg_user' /var/lib/postgresql/11/main/pg_log/postgresql-2021-03-18_194312.log | grep "SELECT"
root@DNS:~# grep 'SELECT hostname' /var/lib/postgresql/11/main/pg_log/postgresql-2021-03-18_194312.log

And also I don't see any errors about the failed connection to DB.

In the script:
--connect to the switch database
dbh_switch = Database.new('switch');
Okay, maybe script can't connect to the DB using credentials from the /etc/fusionpbx/config.lua, but I'm able to run this query at the database using the user, host, password, database, and port from the config.lua, 'database.switch' variable:

--database information
database = {}
database.type = "pgsql";
database.name = "fusionpbx";
database.path = [[]];
database.system = "pgsql://hostaddr=10.20.30.12 port=5432 dbname=fusionpbx user=fusionpbx password=STRONGPASS options=''";
database.switch = "pgsql://hostaddr=10.20.30.12 port=5432 dbname=freeswitch user=fusionpbx password=STRONGPASS options=''";
database.backend = {}
database.backend.base64 = 'luasql'
 

agile

New Member
Oct 21, 2020
27
2
3
42
Actually, I think I found the problem. I enabled query log at Postgresql server, made some calls and I don't see the query from the fs_cli:
2021-03-18 19:56:44.283438 [NOTICE] switch_cpp.cpp:1447 [xml_handler] SQL: SELECT hostname FROM registrations WHERE reg_user = :reg_user AND realm = :domain_name AND expires > :ts ; params:
{"reg_user":"101","ts":1616083004,"domain_name":"fortest"}
2021-03-18 19:56:44.283438 [NOTICE] switch_cpp.cpp:1447 ----------------- database_hostname nil ----------------

But I couldn't find this query in the query log:
root@DNS:~# grep 'reg_user' /var/lib/postgresql/11/main/pg_log/postgresql-2021-03-18_194312.log | grep "select"
root@DNS:~# grep 'reg_user' /var/lib/postgresql/11/main/pg_log/postgresql-2021-03-18_194312.log | grep "SELECT"
root@DNS:~# grep 'SELECT hostname' /var/lib/postgresql/11/main/pg_log/postgresql-2021-03-18_194312.log

And also I don't see any errors about the failed connection to DB.

In the script:
--connect to the switch database
dbh_switch = Database.new('switch');
Okay, maybe script can't connect to the DB using credentials from the /etc/fusionpbx/config.lua, but I'm able to run this query at the database using the user, host, password, database, and port from the config.lua, 'database.switch' variable:

--database information
database = {}
database.type = "pgsql";
database.name = "fusionpbx";
database.path = [[]];
database.system = "pgsql://hostaddr=10.20.30.12 port=5432 dbname=fusionpbx user=fusionpbx password=STRONGPASS options=''";
database.switch = "pgsql://hostaddr=10.20.30.12 port=5432 dbname=freeswitch user=fusionpbx password=STRONGPASS options=''";
database.backend = {}
database.backend.base64 = 'luasql'
Hi,
Thanks for the message. Yes i have the same details in config.lua.
so did the change you did from NOW() to ts fix the issue with CDR and call dropping from pstn network?

please advice.
 

kt351b

Member
Feb 24, 2020
33
1
6
26
Hi,
Thanks for the message. Yes i have the same details in config.lua.
so did the change you did from NOW() to ts fix the issue with CDR and call dropping from pstn network?

please advice.
I don't have issues with CDR and call dropping from pstn network. Everything works. I just wonder what causes such an error, maybe FreeSWICTH wasn't configured correctly to work with the database.
 

cargreg

New Member
Apr 26, 2018
27
2
3
37
Hi guys, i followed the guide and ancounter an issue, the extensions can't calls each other, i have an extension 100 registered on fs1 and another extension 101 registered on fs2, the call terminated with this error 'entering state [terminated][503]' and after starts the voicemail.
Puttings the log in debug level 9 i saw this message

'tport.c:2707 tport_connected() tport_connected(0x7f912c01a300): events CONNECTED ERR
tport.c:4251 tport_release() tport_release(0x7f912c01a300): 0x7f912c017360 by 0x7f912c019780 with (nil)
nta.c:8590 outgoing_print_tport_error() nta: INVITE (49730213): Connection refused (111) with tcp/[my_public_ip]:11905
nua_stack.c:301 nua_stack_event() nua(0x7f90c0006010): event r_invite 503 Service Unavailable'
i think that it's normal specially in a nat environment,
how can i permit the extension call between extension registered in different pbx?
 

iota

New Member
May 29, 2020
24
9
3
USA
Hi guys, i followed the guide and ancounter an issue, the extensions can't calls each other, i have an extension 100 registered on fs1 and another extension 101 registered on fs2, the call terminated with this error 'entering state [terminated][503]' and after starts the voicemail.
Puttings the log in debug level 9 i saw this message

'tport.c:2707 tport_connected() tport_connected(0x7f912c01a300): events CONNECTED ERR
tport.c:4251 tport_release() tport_release(0x7f912c01a300): 0x7f912c017360 by 0x7f912c019780 with (nil)
nta.c:8590 outgoing_print_tport_error() nta: INVITE (49730213): Connection refused (111) with tcp/[my_public_ip]:11905
nua_stack.c:301 nua_stack_event() nua(0x7f90c0006010): event r_invite 503 Service Unavailable'
i think that it's normal specially in a nat environment,
how can i permit the extension call between extension registered in different pbx?
This is not the kind of cluster you might be thinking of. It is more like three nodes with shared configuration. Phones that register to fs1 will not be registered to fs2 (there is a way to do this, but it is fragile and I never got it to work). It is far better to make each phone "dual register" to two nodes. In order words, phoneA and phoneB both need to register to fs1 and fs2, so that phoneA can send the call to either fs1 or fs2 and always have a route to phoneB.

Most devices and softphones have the ability to register to two (or more) nodes at the same time... its poorly documented, but through trial and error, you should be able to figure it out.

For example... Here are some notes I have on Dual Registration with Yealink T46s phones:
Make sure Firmware Version is at or above 66.84.0.15
Dual registration bugs were fixed on this version.

Label and Display Name: __(whatever you want to appear on callee’s phone)__
Register Name: 100
User Name: 100@my.domain.com ("my.domain.com" would be whatever the domain name is in your fusionpbx, as opposed to your server's DNS address)
Password: __(as given in extension page)__
SIP Server 1
Server Host: fs1.domain.com
Transport: TCP
SIP Server 2
Server Host: fs2.domain.com
Transport: TCP
Enable Outbound Proxy Server: Disabled
NAT: Disabled

Note: if NAT traversal is set to STUN, then calls originated by this phone are immediately hung up on and will continue to ring on any and all destination endpoints (softphones, deskphones, and even external calls to landlines and cellphones). This is because both phone and Freeswitch are trying to circumvent NAT issues, and they step on each other’s toes. Just let Freeswitch handle it.

Note: if you want “Server Host” or “Outbound Proxy Server” fields to query SRV records, you must put 0 in the port field next to it. But I've found this to be just as unreliable and slow as the phone's built in Failover options. Don't waste your time with those, and leave them at their defaults.
Most phones are very picky on dual registration. If you were to enable it's "Failover" options, it would totally break Dual Registration. Its quite ironic, since dual registration is far superior to slow and problematic failover logic.

But my favorite has been Grandstream GRP2614 phones with their free cloud management portal:
MAKE SURE PHONE IS ON HOTFIX FIRMWARE 0.9.2.180 or higher (you should no longer need to worry about this)

Best settings if done via GDMS cloud portal (bare minimum settings):
VoIP Account / SIP Server / fs1.domain.com
  • Server Name: my.domain.com (would be whatever the domain name is in your fusionpbx, as opposed to your server's DNS address)
  • SIP Server: fs1.domain.com (dns address for main server)
  • Outbound Proxy: __empty__
  • Backup Outbound Proxy: __empty__
  • Voice Mail Access Number: *97
  • DNS Mode: A Record
  • NAT Traversal: NAT NO
  • Additional Settings: Failover SIP Server: fs2.domain.com
    (doesn’t work yet, but do it anyway for when Grandstream support fixes this. until then, a special text edit has to be made to the template, as below)
VoIP Account / SIP Account / 100 /
  • Account Name: 100 (this shows on phone LCD screen)
  • SIP Server: fs1.domain.com
  • SIP User ID 100@my.domain.com
  • Authenticate ID 100
  • Authenticate Password: __as listed in FusionPBX / Extension page__
  • Name: Bill Jones (shows on callee screen for internal calls)
Template / By Model / __template name__
  • Account 1 / SIP Settings /
    • SIP Transport: TCP
    • Templates / By Group / __applicable template__ / Switch to Text Editor
      in the right numerical order, insert this line:
      2312=fs2.domain.com
      Note: This text edit is a temporary handling until they fix the “SIP Server / Failover SIP Server” setting. This was given to me by Grandstream Support. Remember to remove this line when they fix the “Failover SIP Server” setting.

Its well worth the effort. I've had great luck with Dual Registration and Multi-Node FusionPBX. I have yet to get frantic calls about the phone server being down, two years uptime so far (despite having nodes go down for cloud provider maintenance and outages). I also highly recommend FusionPBX Support. It seems pricy until they save your rear several times.

Don't use AWS... their outbound traffic fees are outrageous. DigitalOcean, Vultr and others are your friend. Combined with a mix of SIP providers like Twilio, Skyetel, Telnyx and Flowroute (using the distributor module to spread outbound traffic amongst providers). They are all pay-as-you-go providers with decent prices. Flowroute has horrible support and inbound calling issues, but their outbound call traffic works great (the more failover, the better).

Also, sngrep is your best friend. Use it to debug strange nat issues.

Good luck!
 

cargreg

New Member
Apr 26, 2018
27
2
3
37
@iota I have abandoned this way, now I created a test environment with 2 vps with fusionpbx , one with postgresql and some nfs share and another one opensips, i made a domain loadbalancer where all the extensions domain are registered on the same fusionpbx
 
Status
Not open for further replies.