So I picked up on something recently, that I wasn’t really aware of.

SRM by default when you are doing testing and failovers, will bring the datastores up and attach them to your hosts with the snap-xxx prefix.

This can be annoying!

For me it was annoying simply because our PRTG monitoring solution, picks up on new datatsores, and once they are removed it will alarm, because they have suddenly vanished!

The thing is the snap-xxx is always random, so every test you would need to clean up the alarm sensors, which got annoying after a while haha.

But the advanced SRM option was pointed out to my in the vExpert Slack Channel.

https://pubs.vmware.com/srm-60/index.jsp?topic=%2Fcom.vmware.srm.admin.doc%2FGUID-E4060824-E3C2-4869-BC39-76E88E2FF9A0.html

storageProvider.fixRecoveredDatastoreNames – Force removal, upon successful completion of a recovery, of the snap-xx prefix applied to recovered datastore names. The default value is false.

 

srm_snapxxx_remove1-1 srm_snapxxx_remove2-1

So you set this to true in the advanced settings at each site and you are good to go, the name datastore will remain the same.

Normally you would expect it :

$ sudo /etc/init.d/vmware-tools start
$ sudo /etc/init.d/vmware-tools stop
$ sudo /etc/init.d/vmware-tools restart

But newer Linux builds have it installed elsewhere:

VMware Tools init script is missing from the /etc/init.d directory on Linux virtual machines (2015685)

  • /etc/vmware-tools/services.sh start
  • /etc/vmware-tools/services.sh stop
  • /etc/vmware-tools/services.sh restart

So you have to run it kinda like you would services.sh restart on an esxi host

 

So basically because of the latency on the link, transfer times were pretty slow between our Centos VMs at each site.

There have been some good articles written by people already out there:

Linux TCP Tuning

Linux Network Tuning for 2013

PING VM in France 56(84) bytes of data. 64 bytes from VM in France 56: icmp_seq=1 ttl=58 time=15.9 ms 64 bytes from VM in France 56: icmp_seq=2 ttl=58 time=15.8 ms 64 bytes from VM in France 56: icmp_seq=3 ttl=58 time=15.7 ms 64 bytes from VM in France 56: icmp_seq=4 ttl=58 time=15.8 ms 64 bytes from VM in France 56: icmp_seq=5 ttl=58 time=15.9 ms 64 bytes from VM in France 56: icmp_seq=6 ttl=58 time=15.9 ms 64 bytes from VM in France 56: icmp_seq=7 ttl=58 time=16.1 ms 64 bytes from VM in France 56: icmp_seq=8 ttl=58 time=15.8 ms 64 bytes from VM in France 56:
ping -M do -s 1472 remoteHost – worked no issue so not a MTU problem

You basically edit the /etc/sysctl.conf file and make adjusts to various aspects of the networkign config to take into account the latency and tune it accordingly.

We tried various combinations, applying the settings at both sides of the link and then rebooting the VMs just for consistency

We settled with:

# 3rd allow testing with buffers up to 12MB
#TCP max buffer size
net.core.rmem_max = 12582912
net.core.wmem_max = 12582912
# increase Linux autotuning TCP buffer limit 12MB
net.ipv4.tcp_rmem = 4096 87380 12582912
net.ipv4.tcp_wmem = 4096 87380 12582912
# increase the length of the processor input queue
net.core.netdev_max_backlog = 5000
# recommended default congestion control is htcp
net.ipv4.tcp_congestion_control=htcp
This seemed to gives us a much better speed that was steady and acceptable for us, as the LES link is used by various other things.

So I was trying to enable some more volume replications and I came across this odd error:

fc-replciationonly

I couldn’t figure out why all of a sudden it would not show my iSCSI replication options. All my current replications are iSCSI and are working properly. I had a look at all the iSCSI settings/links and they all showed up and working.

So I  gave support a call and they did the same checks that I had done and they said in previous instances of this happening, restarting the Enterprise Manager Service fixes the issue.

So we did that and we were back in business!

You can restart the service in two places:

fc-replciationonly_2

Or in the Windows Services section:

fc-replciationonly_1

Once that was done, we were able to see:

fc-replciationonly_3

 

All was well int he world again!

I had this issue where VMware tools would fail to install with:

The Windows Installer Service could not be accessed. if I did it normally by running the installer

If I tried to run it interactively I got this:

tools_update_error2

after digging around, and looking online, and nothing seeming to match, I had a look at the installer service and saw:

tools_update_error1

and behold, once I turned the service back on it installed fine lol

It looks like it was turned off because on some windows boxes, it consumes all the cpu and the only way to get round this was to disable the service on these boxes. These boxes handle lots of processing for various things, and the installer is only enabled when needed for patching and updates etc.

 

I kept getting the following error:

Call “VirtualMachine.Relocate” for object “VM Name” on vCenter Server “vCenter Server Name” failed.

KB Article:

Storage vMotion migration fails with the error: The method is disabled by ‘SYMC-INCR dd-mm-yyyy hh:mm’ (2008957)

So according to the KB there are a few ways we can get round this:

To work around this issue, use one of these options:

  1. Schedule another backup
  2. Manually remove entries from the vCenter Server database
  3. Manually remove entries from the vCenter Server Appliance vPostgres database
  4. Remove and re-add the virtual machine from the inventory
  5. Remove and re-add the ESXi/ESX hosting the virtual machine from the inventory

Now 1 and 4 are by far the easiest options, the rest seem a bit drastic, even though I am sure they would work!

So I ran a one time backup in Veeam and after it completed I did a storage vMotion and worked perfectly. 

Strange error that I had never came across before, these VMs haven’t been backed up in years, simply because they are very easy to replace.

My guess is they were backed up last using Veeam v6.5, that was before I joined the company.

One of the first things I did was upgrade from v6.5 to v8! After chatting to a few people it could have been pre Veeam implementation all together, as that’s the last time they ever remember these VMs being backed up.

 

So I set up SRM with the Compellent SRAs, everything was configured as per the best practice guides.

I would run test plans and they would run fine. But on cleanup, the cleanup would show as finished in SRM, but if you went into Enterprise Manager the test volume was still there and mapped to the cluster. SRM had unmounted and detached it from the hosts, but it appeared the SRA/EM didn’t finish the job. interesting when I did a SRM test going the other way, the clear down worked perfectly.

Anyway I kept going and actually did a planned migration of some VMs from our protected site to the recovery site. This went off perfectly. The VMs were brought up over there without issue, and they could login, the DNS customisation I talked about earlier had been applied. Some of the VMs were being recovered using vSphere Replication and the rest where using Array based replication.

Then came the time to re-protect, the vSphere Replication VMs re-protected fine, but the array based volumes failed.

The error would point to a volume ID that didn’t correspond to either the source or destination volumes.

srm_error

I called up Dell/Compellent support and they asked me to try again, but this time:

Make sure the Enterprise/SAN recycle bins were empty and after the planned migration, to make sure I save the restore points and re-scan the SRAs in SRM.

I did all of this and it still fell over on re-protect. What I found strange was the engineer didn’t even want to look at any logs?! See to me looking at the logs would be one of the first things anyone would have done, but is response was “no need to look at the logs yet, we could end up going in circles”

I pointed him to a reddit post I had made, Reddit Post and they had a similar problem to us. But they noted that when they used EM 2015 R1 they never encountered any problems. We  were on 2015 R3, I tried to call the support engineer a few times, and I couldn’t get hold of him, he was either away or on a webex/call. I did ask when I phoned up for him to call me back…but that never happened. He did email me telling me that downgrading was a good idea, I asked why and he said it was the recommended course of action in this situation.

Since I had trouble getting hold of him on numerous occasions, I called up and asked if there was any other engineer that had SRM/SRA knowledge and I was told oh the engineer assigned to your case will be back in about 10 mins and I will make sure he calls you. Well guess what he never called. I got an email from him at 5pm, but when I replied a minute later I got his out of office and he was gone.

I mean I understand the engineers are busy, but come on that level of service isn’t up to par, esp since I asked for any other senior tech that had SRM/SRA knowledge.

The last time I upgraded the Compellent Data Collectors and Enterprise Manager it was a pain, so I wasn’t looking forward to ti at all. but if it had to be done…..

I backed up the VMs, then took a snapshot downloaded the 2015 R1 version and uninstalled everything, and then did a fresh install with the older version. I took screenshots of all the key information so I could put it back in later. I had to configure up the EM users again for people/myself/SRM/SRAs. I did all that and then did same uninstall/reinstall for the remote data collector at the recovery site

When I re-did the array pairs in SRM, it would go through the process fine but not actually show any replications.

srm_error1

I spent a while pondering this, then I realised I had come across this before! It was down to the fact the accounts created for SRM to use with the SRA didn’t have the controllers mapped to it, so it would never see the actual replications.! So I logged into Enterprise Manager with the SRA user account and mapped in the controllers so all the replications would be visible!

Once I got over that, I set up some Protection Groups and Recovery Plans. I tested them out and this time the clear-down was flawless both ways, and the re-protects for the array based replications was fine too!

srm_planned_migration

 

All in all it came down to some kind of bug in Enterprise Manager 2015 R3!

I originally wrote this for http://vmusketeers.com/

So to apply this fix:

First thing I did is I took a Veeam backup of the VM, then I copied over the files as downloaded from:

Additional patch for security issue CVE-2015-2342 for vCenter Server 5.x on Windows (2144428)

I did a md5 check

Opened Command Prompt as administrator, and navigated to the folder where I had extracted the script and ran:

cscript JMXScript5.<x>.vbs “<location_of_wrapper.conf_file>”

cscript JMXScript5.5.vbs “C:\Program Files\VMware\Infrastructure\vShereWebClient\server\bin\service\conf\wrapper.conf”

Since I was running 5.5u3b, I already had a partial fix, so it was just a matter of running the script to fully patch it. The script stops the web service does it business and restarts it again, takes about a minute or so.

The script outputs a log file to the same folder as the actual script:

cve-2015-2342_patch1

 

I then logged into the web client and confirmed it was all working as it should, and you will see this at the command line:

cve-2015-2342_patch2

I then did exactly the same thing at my DR site too.

Important: Check for the Finish script execution message in the jmxscript.log file to confirm the execution of the script. Location of the log file is same as the folder from where the script is executed.

I originally wrote this for http://vmusketeers.com/

Basically I had a recovery plan, and I had done tests/cleanups/migrations/re-protects and it all worked exactly as expected, but I whenever I wanted to export the history reports, it wouldn’t do anything!

I would pick the type of report I wanted (html/doc etc), I would click generate report. It would generate it fine.

But when I clicked download…I would get nothing. I click it and click it and nothing would happen?!

I tried Chrome and Firefox, I had rebooted the SRM  service at both sites, the Web Client service at both sites. I had tried a different user account, I simply could not get the report to export.

But every other plan exported fine.

Then I was asking in the VMware community forums and someone said there was a know bug, that would stop an export if the Recovery Plan had any special characters such as / or %.

Behold mine had a ‘/’ in it!

After I renamed it and removed the special character, it exported every-time!

SO PLEASE REMOVE ANY SPECIAL CHARACTERS FROM YOUR RECOVERY PLANS!

So I randomly started getting Error 1009 in the vSphere web-client 5.5u3b. I had a dig around int he logs and it lead me to this KB article

So basically I did what it said:

  • Stop the Web Client Service
  • Take a snapshot and a backup of the vSphere Web Client VM (one of the perks of having it separated out)
  • in the Windows vCenter I went to C:\programdata\vmware\vSphere Web Client\SerenityDB\serenity
  • Then delete all the files in the folder
  • Restart the Web Client Service
  • Check all is working
  • Clear snapshot

Not really rocket science but it cleared the issues I was having and it hasn’t come back since!