Experimenting with a full load-sharing cluster

Status
Not open for further replies.

phonesimon

Member
Apr 21, 2017
87
16
8
44
Overview

I wanted to share my general notes on what I have working so far, and how, to achieve a full load-sharing FusionPBX cluster. I am hoping that we can collaborate in this thread to work up a document showing what's possible and what's not.

My goal is to have a PBX cluster with separated data layer and service layer so that any number of FreeSWITCH nodes (service layer) can be joined in and the configs and live data will all exist in the database, which does not necessarily have to coexist with FreeSWITCH. That said, there's value in having a local DB on every FS machine to reduce latency between the DB and the things that are accessing it (FS, lua scripts, FusionPBX web interface). So what I had in mind is just to do a full install of FusionPBX on every node and link the databases.

I'm not too comfortable with BDR yet so at this point in my experiments I have two nodes and both of them pointing to a single node for their shared fusionpbx and freeswitch databases. I believe the effective result is the same as if we were doing BDR and each FusionPBX accessed its local clustered DB instance.

The FS nodes should be able to service any endpoints or incoming calls routed from a provider. By way of the shared freeswitch database they should be able to route internal calls between nodes in order to do extension-to-extension calling, call park and retrieval, conferencing, queues, etc.

Of course, this should work with the multi-tenant domain model.

I'll add posts to the thread to write about some of the specifics.

NOTE! There are likely errors in my ideas and methods so please point them out. Ultimately I would like a refined and accurate how-to guide to come out of this.
 
Last edited:

phonesimon

Member
Apr 21, 2017
87
16
8
44
Lab Setup

Two nodes on Digital Ocean, two different datacenters in the same region. I launched two Debian 8 instances @ 1GB RAM.

On both nodes I did standard installs of FusionPBX using the installer at https://github.com/fusionpbx/fusionpbx-install.sh . Before running debian/install.sh, I edited debian/resources/config.sh and made the following changes:
  • system_branch=master
  • database_repo=2ndquadrant
(I chose the 2ndquadrant repo so that I could change this over to BDR later.)

Run the installs.

I decided "node1" would be my DB host so I basically ignored the DB installation on node2 at this point.

Configure postgresql on node1 so that it listens on '*' interfaces (postgresql.conf) and provide access for the other node (in pg_hba.conf). Also add a line for node1's IP. Adjust iptables to allow access to postgres port 5432 from node2 to node1.

DB

Edit /etc/fusionpbx/config.lua on both nodes. Set the database.system and database.switch DSNs to point to the public address on node1 using the password in the file of node1.

While in this file, find the xml_handler.fs_path line and set it to true, as this will be needed for cluster routing.

Edit /etc/fusionpbx/config.php in the same manner as above. Point both nodes to the database on node1.

Now from the web interface of node1 you can log in using admin/the password given after setup. From the web interface of node2 you can log in using admin@node1ipaddress and the password for node1.
 

phonesimon

Member
Apr 21, 2017
87
16
8
44
Telling FS to use PGSQL as its core DB

Edit /etc/freeswitch/autoload_configs/switch.conf.xml on both nodes.

Uncomment the "core-db-dsn" param which has the value $${dsn}.

Uncomment the "core-dbtype" param and set the value to "pgsql".

In fusionpbx, go to the Variables and add a new variable:
  • name: dsn
  • value: the same DSN string used in the config.lua file with dbname=freeswitch
  • enabled: true
Go to the SIP Profiles section and edit each SIP profile. Enable the "odbc-dsn" parameter which already points to $${dsn}. (Note - I don't know strictly why this is necessary to specify odbc settings for the SIP profiles if we have postgres enabled for the core)

At this point to get all the configs reloaded I restarted memcached and freeswitch on both nodes. Using psql you should be able to check the freeswitch database and see that it populated the schema and if you select from the freeswitch.interfaces table you can see entries for both nodes.
 
Last edited:

phonesimon

Member
Apr 21, 2017
87
16
8
44
Configure FusionPBX for DB storage

My goal is to store as much as possible in the database so that we don't have to worry about data on the filesystem and can add and remove service nodes somewhat freely. There are likely performance considerations in doing this, but I don't know yet. Doing filesystem replication with corosync among nodes is the other option and has been written about elsewhere on the forum.

FusionPBX -> Advanced -> Default Settings
  • Recordings storage_type = base64, Enabled = true (store recordings in the database)
  • Voicemail storage_type = base64, Enabled = true (store voicemail messages in the database rather than in the filesystem)
  • Fax storage_type = base64, Enabled = true

lua-sql is needed for this. Install it on Debian like this:
  • apt-get install libpq-dev lua-sql-postgres-dev
 
Last edited:

phonesimon

Member
Apr 21, 2017
87
16
8
44
Permitting intra-cluster calling

Some intra-cluster calling will work without making any changes; for example, extension-to-extension calling when the extensions are registered on different servers. If you watch the FreeSWITCH consoles you will see that the call is initiated (for example) on node1, finds the registered target on node2 and sends the call there. Node2 then sees the call and because it is coming in on the internal profile (sip 5060) it issues an auth challenge, which the caller is able to answer because he is a member of the domain.

I worked out something different to eliminate the auth challenge and handle situations where this wouldn't work.

In Advanced - Access Controls I have edited the Domains ACL and used this to keep all of the cluster nodes. NOTE - if you do this then you can't use this ACL for your providers - so make another one for that purpose or send your providers to the external profile as you should. My Domains ACL contains the IP addresses of the two nodes.

Then I edited SIP Profiles - Internal: apply-inbound-acl = domains:cluster-in (rather than just domains). This syntax (according to https://freeswitch.org/confluence/display/FREESWITCH/ACL ) says that any calls that match this ACL should go to the context "cluster-in" rather than the default context for this profile which is "public."

Now create the cluster-in context:

Mine looks like this in Dialplan manager and what it does is figures out where the call is destined based on the domain of the call and transfers it into that part of the dialplan; pretty simple.

1520825406402.png

I'm still testing it out but this little piece of dialplan seems to handle what I have thrown at it so far.
 

phonesimon

Member
Apr 21, 2017
87
16
8
44
Test Results (so far)

Test setup:
  • a domain called "test.example.com" set up in FusionPBX
  • extensions 1000 and 1001
  • conference room 2000
  • DNS SRV for test.example.com pointing to node1 and node2 with equal weight/priority
Registered 1000 and 1001 and forced them to use node1 (1000) and node2 (1001) by specifying the proxy setting in the SIP client. So we are testing cross-cluster domain calls.

Works
  • extension-to-extension calls (both directions)
  • call hold/resume
  • blind transfer
  • attended transfer
  • conference
    • whoever starts the conference by dialing 2000 first hosts the conference on his node; when the other extension calls in, his call is routed over to that node to join the conference
  • call park and park retrieval
  • inbound calls from PSTN provider to either server (using DNS SRV pointing to external profile port 5080)
(More to come)
 
Last edited:

DigitalDaz

Administrator
Staff member
Sep 29, 2016
3,070
577
113
Looking good, at this point I'm jumping on and spinning up a couple of boxes to replicate this scenario as I have never seen these thing work :)
 

smn

Member
Jul 18, 2017
201
20
18
Great work. I would be interested to see what would happen if you have the two DB's clustered and one node fails or stops responding.

Do both nodes have the exact same hostname? Since you have two IP's you must be relying on NAPTR/SRV for the load balancing correct?
 
Last edited:

phonesimon

Member
Apr 21, 2017
87
16
8
44
Yes use SIP SRV records for the load sharing/balancing on each SIP domain; the individual hostnames are different. I think DB is the most sensitive part. Would be nice if there were a failover/LB option for DB connections. With MySQL I have used mysqlproxy and mysql-router in the past. I don't know what the postgresql option is.
 

smn

Member
Jul 18, 2017
201
20
18
Yes use SIP SRV records for the load sharing/balancing on each SIP domain; the individual hostnames are different. I think DB is the most sensitive part. Would be nice if there were a failover/LB option for DB connections. With MySQL I have used mysqlproxy and mysql-router in the past. I don't know what the postgresql option is.

Thanks for the info. I think I will try test this out myself.
 

Maani

Member
Nov 12, 2017
34
1
8
53
Hi thank you for sharing
You used same switchname in both nodes and different hostname?
 

mutt

New Member
May 10, 2018
29
0
1
42
This is pretty cool. I've got WebRTC clients connecting to a pool of servers behind a load balancer. The connections establish randomly to the servers in pool, and registration is shared in the common pgsql database. Inter-cluster calling works with fs_path - but my inbound calls I had to create a dedicated context for them to regex the number (depending on where it originated) before transferring to the context.

Works like a charm. Conferencing doesn't work, so I just created a dedicated conference server. The rest seems to work.
 

Hein Tonny Køien

New Member
Oct 21, 2018
1
0
1
47
Yes use SIP SRV records for the load sharing/balancing on each SIP domain; the individual hostnames are different. I think DB is the most sensitive part. Would be nice if there were a failover/LB option for DB connections. With MySQL I have used mysqlproxy and mysql-router in the past. I don't know what the postgresql option is.
Grate work, I will give this a try within Microsoft Azure. In Microsoft Azure there is a PostgresSQL managed service cluster, it offers an SLA of 99.99% starting at only < 30 USD/month. It's only available with version 9.5 to 10. Do you have any experience with FreeSwitch and PostgresSQL version 9.5?
 

SlimJim

New Member
Feb 6, 2018
11
1
3
34
Northern Indiana
Grate work, I will give this a try within Microsoft Azure. In Microsoft Azure there is a PostgresSQL managed service cluster, it offers an SLA of 99.99% starting at only < 30 USD/month. It's only available with version 9.5 to 10. Do you have any experience with FreeSwitch and PostgresSQL version 9.5?

Did you ever get this to work? My odbc is having issues with the "@" in the username.
 

dannyztar

New Member
Jul 24, 2018
7
0
1
34
phonesimon, I configured a similar setup using AWS. I used EFS for the shared storage and RDS for my DBs. It works pretty well. I still have to do some more testing. I have one issue right now though, when the cluster dialplan, transfers the call to the extension on another switch I lose caller ID info. Instead, it seems that is exporting the destination number as the caller ID. Does anyone know how I can insert the original caller ID to the bridged call destined to switch B?

This is what I see on the logs.

Action export(origination_callee_id_name=${destination_number})2019-03-21 11_53_25-Dialplan - FusionPBX.png
 
Last edited:

Mikey

New Member
Feb 10, 2020
15
1
3
54
I'm trying to follow this setup. I have two servers with a different hostname. (server1.domain.com and server2.domain.com) and one SRV/A domain of sip.domain.com). Both servers share the same Freeswitch and Fusion DB. Few questions;

1) What are you using for external_sip_ip and ext-sip-ip? Should it be "host:sip.domain.com" or something else?

2) Do I need two internal profiles so each server binds to its own IP? I find across both servers on the Sip Status page both at times bind to the same IP for the same internal profile
Thanks!
 

phonesimon

Member
Apr 21, 2017
87
16
8
44
While I had good success with the setup I documented in 2018, I have since moved on to other projects and have not spent any time further refining this. I plan to do so and will post again when I have had some more time to work on it.
 

Mikey

New Member
Feb 10, 2020
15
1
3
54
Ok thanks phonesimon. I'm just stuck on these two parts if you or anyone knows the answer to this;


1) What are you using for external_sip_ip and ext-sip-ip? Should it be the main SRV domain "host:sip.domain.com" or something else?

2) Do I need two internal profiles so each server binds to its own IP/hostname? I find across both servers on the Sip Status page both at times bind to the same IP for the same internal profile
Thanks!
 

Mikey

New Member
Feb 10, 2020
15
1
3
54
Ok thanks phonesimon. I tested this and can confirm that works. I was not expecting it to work when both servers happened to bind to the same IP on the internal profile however it works with calls going to both servers.

Now the only thing I can not figure out how to test is the client side "failover". Testing on Bria Mobile if I crash one of the servers during an active call the app doesn't seem to go next in list on the SRV. However hanging up the call and immediately placing another I hit the remaining up server.

Any settings DNS or Freeswitch wise to encourage the other SRV servers to pick up, and the client to try, the failed servers connection? Here are the DNS settings so far;

SRV
_sip._tcp.sip.aws.domain.com
10 50 5060 server-0.aws.domain.com
10 50 5060 server-1.aws.domain.com

SRV
_sip._udp.sip.aws.domain.com
10 50 5060 server-0.aws.domain.com
10 50 5060 server-1.aws.domain.com

SRV
sip.aws.domain.com
10 50 5060 server-0.aws.domain.com
10 50 5060 server-1.aws.domain.com

A
sip.aws.domain.com
1.2.3.4
2.3.4.5
 
Status
Not open for further replies.