what type of HA is mostly everyone doing

Status
Not open for further replies.

djacob

Member
Oct 31, 2016
43
8
8
51
Bensalem PA , USA
Hey All,

Trying to design the system and figure out what kind of HA to do. wanted to do domain based load balancing but it seems from here there is no one really doing it anymore. is there a reason why?

Is it the presence issue?

Does it even work anymore?

what are your network layouts?

I see in the member section of fusion there is some info on dom based load bal. but nothing up to date or in there for the kamailio setup. I asked Mark about it and he said its not really something he supports since its not fusion. I can understand that, its a addon to his platform.

I have made the script work with Kam 5.4. will be testing the presence tomorrow. i only have 2 phones up and 2 fusion boxes.

Just want to see if anyone is doing it anymore or not.

Thanks
Dave
 

DigitalDaz

Administrator
Staff member
Sep 29, 2016
3,070
577
113
Hi Dave,

No reason to believe Kamailio wouldn't scale it out.

I'm personally a much smaller provider and with hardware being relatively cheap these days, I just do not find the need for Kamailio.

I just use pairs of servers and route53 to switch over.
 

pbz

New Member
Feb 10, 2022
9
0
1
124
I just use pairs of servers and route53 to switch over.
Do you give each tenant/domain their own subdomain? For example t1.domain.com, t2.domain.com or do you use multi-tenant setups with subdomains like west1.domain.com, west2.domain.com.

Also how fast do route53 dns changes take effect? I find that with DNS even if you set the TTL very low you have to worry about every other DNS server caching the data for much longer than your TTL. So when I have made DNS changes in the past to live server there is a painful propagation period of a few hours. This propagation period seems like a problem if you are trying to use it for failover.
 

DigitalDaz

Administrator
Staff member
Sep 29, 2016
3,070
577
113
I user the t1.domain.com, t2.domain.com scenario

I set all my ttls to 120 and route53 just uses a records. If the primary fails, the primary a record is removed. I have never had any problem with failover DNS wise.
 

pbz

New Member
Feb 10, 2022
9
0
1
124
Are you using the Route53 health checks for automatic failover or are you doing it manually?
 

OhSeeGee

New Member
Jan 15, 2022
12
1
3
Wishing I was in Jamaica
I think people have different definitions of HA. Sometimes HA is confused with load balancing. To me HA means one box taking over telephony services when the other box fails. But the devil is in the details.

There are lots of simple tests like check process running, OS alive, external ping, (route 53) you can do with scripting...though when you test enough stuff that script/code gets pretty big. The bigger problem is when freeswitch fails but the box/process/OS are still there ticking away. Great example is OS running out of file handles. Calls may not bridge anymore but to external checks everything looks ok. (That's also why generic heartbeat + cluster tools aren't terribly useful for telephony specific outages).

IMHO, HA means serious health checking. Next, intelligent negotiating between the 2 boxes should they both come up, or alternate failures, etc. I see STONITH was popular for a while but I don't agree with it. If this is for a small install then this may not matter, but preventing one dying PBX from corrupting the other is a big deal in large call centers. Keeping the boxes in sync without the dying one corrupting the healthy one is a tough nut to crack.

So route 53 health checks are a great start. You really have to figure out the cost/impact of an outage, and scale your solution to meet the needs of the customer. For large/critical call centers commercial solutions may be appropriate, for a mom & pop shop (or home use) some DIY scripting or route 53 health check are probably ample.
 
Last edited:
  • Like
Reactions: gflow

DigitalDaz

Administrator
Staff member
Sep 29, 2016
3,070
577
113
For me, route53 is perfectly adequate. I also have other health checks going on with Nagios doing sip options pings, disk space test etc, etc,

It has server me well for a few years now. When I do maintenance I usually get a very smooth failover, I may have to use IP tables to block ping access to a few stragglers but that's been the only problem, the same with failing back over.

I think that far more important is reliable hardware and datacenters.
 

OhSeeGee

New Member
Jan 15, 2022
12
1
3
Wishing I was in Jamaica
I worked with an ITSP who built their own HA software (running on hundreds of clusters). It worked ok, but constant effort to upgrade, maintain, fix, etc. made it really uneconomical. And every time their HA software failed they updated it for the new "use case". When you run 500-1000 PBX's you discover all the interesting ways in which telephony services can fail. And you see why simplistic solutions don't cover all the possible scenarios. When you have a legal requirement to deliver phone services (in case of emergency etc) I think the penalties can be pretty big too. Their hardware was excellent (and even had a lot of multi-cloud clusters) - but node failure was rarely due to hardware failure.

Eventually they gave up and bought a commercial product. But it has to be worthwhile for your case. If this is your own cluster, or phones are not critical (you're not losing a thousand dollars a minute with an outage), I think DIY is a a reasonable approach. It's always hard to spend $ for an add-on for a FOSS product. (At least in principal). But business realities kick in, and like the ITSP the right decision just hit them in the wallet.

I too avoid commercial products if FOSS does the job. I resisted MS Office for a long time and tried to convince our own company that Libre Office was good enough. But I have to admit that the accumulation of little technical issues caused by the free option was costing us more than just buying MS Office. I'm actually a champion of FOSS...and no fan of certain big tech companies....but I swallow the pill and hand over the money when necessary.
 
Last edited:
Status
Not open for further replies.