When I was trying to do a rescan using the SRA in SRM I came across this error:

SRA command ‘discoverDevices’ failed. Error running command on Enterprise Manager [Command: discoverDevices] [Error: StorageCenterError – Error getting ” Volume ” from SC [SN: 23,424 ] [Name: Storage Center 2 – Aberford Production ] (” IO Exception communicating with the Storage Center: No buffer space available (maximum connections reached?): connect “)]

I checked the Enterprise Manager logs for error details:

No buffer space available (maximum connections reached?)

I had a look at netstat:

sanent_hotfix1

as you can see from the screenshot, the max amount of sockets had been reached

After doing some digging, it appears its a knowing issue on 2008 r2 and there is a hot fix:

https://support.microsoft.com/en-us/kb/2577795

http://supportkb.riverbed.com/support/index/index?page=content&id=S23580&cat=VNE_SERVER&actp=LIST

https://support.microsoft.com/en-us/kb/2577795

I downloaded the hot fix and applied it to both my primary and secondary data collectors, as they both run 2008 r2. As always I took a snapshot first and then applied the hotfix.

The issues has not come back since!

Configuring the alarms isn’t all that easy and not documented that great.

In the vExpert slack I was chatting to @railroadmanuk Tim as he was trying to configure up some alarms, and this is something I have been meaning to dig into more myself.

So he started off and set me in the right direction, as you have to do it from the vCenter server . So vCenter Server > Manage > Alarm Definitions.

Then you create a new alarm.

I needed to create an alarm for when a new VM is put onto a replicated datastore and an alarm when a currently protected VM has been configured in a way, which would impact its protection status (eg new VMDK on a non replicated datastore or a new vNIC to a un-mapped port-group).

I had to fiddle with the alarms a bit to get the right mix of what I wanted, as I could get the alarm to show an alert that a new VM had been discovered, but I couldnt get it to clear manually. I eventually figured it out as shown:

new_srm_alarm1

For VM Discovered there is no direct opposite to that really, so I had to try a few other options till I found what worked well enough. Basically the testing involved moving a test VM to a replicated datastore and then moving it off that datastore and then also protecting the VM to confirm the alarm cleared in both cases.

The unprotected alarm was much simpler to work out:

new_srm_alarm2

Once they were done and tested I set them to email their alerts to the group mailbox for someone to pick up. The emails were set to send every 720mins which is basically every 12 hours.

 

I originally wrote this for http://vmusketeers.com/

When you use SRM (in my case SRM version 5.8.1) in combination with VR to protect your VMs, SRM will own these VMs. You will not be able to recover these protected VMs using vSphere Replication on its own, as long as they are SRM protected.  This became an issue for us, simply because we had an issue where the SRM service would fail to start at the DR site. Working with VMware support we got it working again but neither VMware Support or myself really knew what the root cause was.

A little SRA and SRM history

We are using SRM in combination with SRA’s and had a Recovery Plan that was stuck in partial re-protect mode and no amount of fiddling would get it sorted. To bypass this SRM protection, VMware support provided me with a tool that interrogates the SRM database and which removes all references to a specific Recovery Plan/Protection Group.  This tool worked well and removed all references, but as soon as I used it, the SRM service at the DR site would not start, and the SRM logs were of no help figuring out what happened. We reverted the SQL DB from backup at both sites and re-did it…again and again and again. Every time we had the exact same result at which point VMware said all that was left was to do a fresh install from scratch (urghhh).

Oddly enough when I tried it the next day it did work! The services started and no matter what I did, no matter how many times I tried to replicate the issue, I simply couldn’t.

VMware still cannot explain why this happened and neither can I. As great as it is that it is working, not knowing the root cause, troubles me quite a bit!

A VR solution

So to mitigate this issue from happening again, I moved away from array based replication and moved to VR where possible, because I didn’t want to rely on the SRA (everyone I speak to regardless of vendor has issues with the SRA), these days it seems to be recommended that if you can use VR you should as it just works.

So by using VR we knew that we could use it with SRM and also use it without SRM and recovery them one by one using VR on its own if needed. Worse case we could add the replicated VMs to the inventory ourselves.

Anyway I noticed that when clicking on a VM to recover on its own, that was also protected by SRM…..I simply couldn’t.

This puzzled me fore a while as I have always assumed that VR worked with SRM but could always be run independently regardless. I spoke to people using 6.0 and they said the option to do it manually was there but for some reason I just couldn’t no matter what I did.

srm-and-vr-screenshot-1

So after much digging and asking around no one really knew and it was puzzling, I knew if I removed the SRM protection I could recover the Vm using VR manually fine.

So after digging around int he SRM 5.8 admin guide on page 114 I found this:

http://pubs.vmware.com/srm-58/topic/com.vmware.ICbase/PDF/srm-admin-5-8.pdf page 114

Change vSphere Replication Settings

You can adjust global settings to change how Site Recovery Manager interacts with vSphere Replication. 

  1.  In the vSphere Web Client, click Site Recovery > Sites, and select a site.
  2. On the Manage tab, click Advanced Settings.
  3. Click vSphere Replication.
  4. Click Edit to modify the vSphere Replication settings.

Option Description

Allow vSphere Replication to recover virtual machines that are included in Site Recovery Manager recovery plans independently of Site Recovery Manager. The default value is false.

If you configure vSphere Replication on a virtual machine and include the virtual machine in a Site Recovery Manager recovery plan, you cannot recover the virtual machine by using vSphere Replication independently of Site Recovery Manager. To allow vSphere Replication to recover virtual machines independently of Site Recovery Manager, select the allowOtherSolutionTagInRecovery check box.

Now see this seems to be exactly what I was after….or so you would think heh.

After changing the setting to “true” at both sites for SRM, the issue still remained. I restarted the SRM/VC/Web service at both sites and I was still unable to manually recover VR VMs that were also protected by SRM.

Next I deleted the Protection Group for my test VMs and then they became tagged by VR again and I could recover manually, but adding them back to a Protection Group tagged them as SRM and I was unable to recover them manually again.

I already had a VMware Support Case open, from when we were dealing with the SRM service dying at our DR site. The engineer seemed to think there was some kind of glitch as everyone in support assumed that you could always recover VR replicated VMs manually in the GUI regardless of whether they were part of an SRM Protection Group.

I did some more digging and spoke to @Mike_Laverick  he wrote the book on SRM back on 5.0, but he hasn’t worked with SRM in a while, but he put me in touch with GS Khalsa @gurusimran who started to look into it and brought it up with the engineering team at VMware.

I decided to have a fiddle with the SRM HOL which is currently @6.1, on initial look it appeared it would let you fail over VR VMs manually even if tagged by SRM.

 

srm_vr_unabletodo_581png

But when you try the first screen stops you saying they are tagged by SRM.

srm_vr_unabletodo_581_pic2

The advanced setting made no difference either way! So exactly the same thing as I was experiencing in 5.8.1. From what I can tell, everyone assumes that since they can see the Red Play button in VR that it’ll allow them to recover the VM regardless of its SRM status!

After discussing it with GS and with him speaking to the engineering team, it actually looks like the documentation is incorrect, and as such he has opened an internal case to get the description for the advanced setting changed.

As from what he has been told it is more to do with how SRM interacts with other 3rd party solutions something along the lines of  “To allow SRM to recover VMs whose replications are managed by other solutions, check this box.”

 

I originally wrote this for http://vmusketeers.com/

So you can use vRDMs with vSphere Replication:

vSphere Replication and Virtual Raw Device Mappings

There aren’t many use cases for it and it isnt documented very well.

So I have just been pondering over something, vSphere replication supports VMs that have vRDMs but not pRDMs. ESXi5.5 now supports vRDMs up to 62TB. What it does, is it will bring the vRDM up as a VMDK file at the DR site.

I am throwing this out there as a possible option for replication of our VMs that have RDMs attached, since we have a few VMs that have RDM, that need to be replicated too.

“If you wish to maintain the use of a virtual RDM at the target location, it is possible to create a virtual RDM at the target location using the same size LUN, unregister (not delete from disk) the virtual machine to which the virtual RDM is attached from the vCenter Server inventory, and then use that virtual RDM as a seed for replication. However, this process is a bit more cumbersome – especially compared to what we just discussed above.”

So its possible even when using vRDMs at the target site as a seed, so in theory we could ditch Array based replication all together.

I always thought that if you used vRDMS it just brought them up as a VMDK at the DR site and that was that…clearly I was wrong!

We did ponder over moving to large VMDKS as over 2TB is unsupported now but:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2058287

You cannot hot-extend a virtual disk if the capacity after extending the disk is equal to or greater than 2 TB. Only offline extension of GPT-partitioned disks beyond 2 TB is possible.

We need to be able to extend these large disks on the fly, without downtime so that rules out large VMDKs.

Virtual Compatibility mode:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1007021

To safely expand the RDM:

1.Expand the RDM LUN from the SAN side.
2.Perform rescan on the ESX and verify the new LUN size is observed.

Recreate the RDM mapping to update the mapped disk size using one of these methods:

3a. Utilise Storage vMotion to migrate the Virtual RDM disk’s pointer file (vSphere 4.0 and later).

Or

3b. Remove the RDM file from the Virtual Machine and delete from disk. Power off the virtual machine, note the scsiX:Y position of the RDM in VM Settings. Navigate to VM Settings > Add > Hard Disk > RDM, select the scsiX:Y position that the RDM was using before and then power on the virtual machine.

Perform a re-scan from the guest operating system.

I’ve tried all the methods, a re-scan and Storage vMotion is the easiest and doesn’t involve any downtime.

You can then extend it in the OS and sVmotion it back to its original datastore

I attached a vRDM to my test VM, this stopped VR replication as it detected a new disk that needed to be configured.

  • I dumped some random text files on the RDM in Microsoft Windows
  • I then replicated the RDM volume via the SAN
  • Killed the replication and attached that volume to the DR cluster
  • I added the seed to the inventory and attached the replication vRDM to it using the same scsi ID
  • I then re configured replication and it picked it the seed

It then did a full sync comparing the data.

Even if there is no data replicated a full sync can take an age to do, as it does a checksum compare for integrity. For the about 500gb it takes on average 2 hours per VMDK .

So for very large vRDMS, a sync even when using seeds could run into hours and hours. Also bare in mind when you resize the disks it will do a full sync again, as for any VR replicated VMs to expand the disk you have pause replication resize the disk at both ends and then and then restart replication. So this could leave you without protection for a long time.

This for me was a deal breaker really, but I kept on testing.

I then extended the vRDM at the Protected Site by 10GB, I did the re-scan and Storage vMotioned the VM  and it came up in the host.

I paused replication and then did the resize at the DR site. I did it by manually adding the seed to the inventory and removing the vRDM from the VM at the DR site and adding it back in using the same SCSI ID (using option 3b from the list earlier), but when configuring it back up for replication I got a UUID mismatch, so I edited the VMDK file to match the source and then VR would re-sync.

KB on modifying UUIDs for seeds:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2041657

There is no documentation anywhere, about how to deal with resizing a vRDM that is protected by vSphere Replication, but I am assuming the way I did it is correct, or as close to being correct haha

With pRDMs it will give you a warning and let you configure up the rest of the VM but on completion it will say “A general system error occurred: No such device”.

I was hoping it would still replicate the rest of the VM and ignore the pRDM #sadtimes ha

Now this was more for experimentation on my end, so I could evaluate all options and see which worked best for me. As I have encountered my fair share of issues using the Dell Compellent SRA and vSphere Replication on the whole has been pretty flawless.

Run this script :

https://www.vmadmin.co.uk/resources/48-vspherepowercli/251-powerclinetworkoutputcsv

That gets you all the mac addresses for the hosts in the cluster.

Now follow what is in this KB:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=10051

vmkfstools -D /vmfs/volumes/UUID/VMDIR/LOCKEDFILE.xxx

run that on the vmdk file you cant delete you’ll get an output like

Lock [type 10c00001 offset 233148416 v 1069, hb offset 4075520

gen 57, mode 1, owner 570d1952-3933cca0-906d-bc305bf57cf4 mtime 8947925

num 0 gblnum 0 gblgen 0 gblbrk 0]

Addr <4, 552, 50>, gen 1050, links 1, type reg, flags 0, uid 0, gid 0, mode 600

Now match value in RED  against the list you have in that csv output you got earlier.
You will find the host that maintains the lock, log onto that host and see if there is a VM in an unknown state, if there is…… delete it.
If not log into the datastore from this host and delete the lcoked file and you should be able to just fine.

So I picked up on something recently, that I wasn’t really aware of.

SRM by default when you are doing testing and failovers, will bring the datastores up and attach them to your hosts with the snap-xxx prefix.

This can be annoying!

For me it was annoying simply because our PRTG monitoring solution, picks up on new datatsores, and once they are removed it will alarm, because they have suddenly vanished!

The thing is the snap-xxx is always random, so every test you would need to clean up the alarm sensors, which got annoying after a while haha.

But the advanced SRM option was pointed out to my in the vExpert Slack Channel.

https://pubs.vmware.com/srm-60/index.jsp?topic=%2Fcom.vmware.srm.admin.doc%2FGUID-E4060824-E3C2-4869-BC39-76E88E2FF9A0.html

storageProvider.fixRecoveredDatastoreNames – Force removal, upon successful completion of a recovery, of the snap-xx prefix applied to recovered datastore names. The default value is false.

 

srm_snapxxx_remove1-1 srm_snapxxx_remove2-1

So you set this to true in the advanced settings at each site and you are good to go, the name datastore will remain the same.

Normally you would expect it :

$ sudo /etc/init.d/vmware-tools start
$ sudo /etc/init.d/vmware-tools stop
$ sudo /etc/init.d/vmware-tools restart

But newer Linux builds have it installed elsewhere:

VMware Tools init script is missing from the /etc/init.d directory on Linux virtual machines (2015685)

  • /etc/vmware-tools/services.sh start
  • /etc/vmware-tools/services.sh stop
  • /etc/vmware-tools/services.sh restart

So you have to run it kinda like you would services.sh restart on an esxi host

 

So basically because of the latency on the link, transfer times were pretty slow between our Centos VMs at each site.

There have been some good articles written by people already out there:

Linux TCP Tuning

Linux Network Tuning for 2013

PING VM in France 56(84) bytes of data. 64 bytes from VM in France 56: icmp_seq=1 ttl=58 time=15.9 ms 64 bytes from VM in France 56: icmp_seq=2 ttl=58 time=15.8 ms 64 bytes from VM in France 56: icmp_seq=3 ttl=58 time=15.7 ms 64 bytes from VM in France 56: icmp_seq=4 ttl=58 time=15.8 ms 64 bytes from VM in France 56: icmp_seq=5 ttl=58 time=15.9 ms 64 bytes from VM in France 56: icmp_seq=6 ttl=58 time=15.9 ms 64 bytes from VM in France 56: icmp_seq=7 ttl=58 time=16.1 ms 64 bytes from VM in France 56: icmp_seq=8 ttl=58 time=15.8 ms 64 bytes from VM in France 56:
ping -M do -s 1472 remoteHost – worked no issue so not a MTU problem

You basically edit the /etc/sysctl.conf file and make adjusts to various aspects of the networkign config to take into account the latency and tune it accordingly.

We tried various combinations, applying the settings at both sides of the link and then rebooting the VMs just for consistency

We settled with:

# 3rd allow testing with buffers up to 12MB
#TCP max buffer size
net.core.rmem_max = 12582912
net.core.wmem_max = 12582912
# increase Linux autotuning TCP buffer limit 12MB
net.ipv4.tcp_rmem = 4096 87380 12582912
net.ipv4.tcp_wmem = 4096 87380 12582912
# increase the length of the processor input queue
net.core.netdev_max_backlog = 5000
# recommended default congestion control is htcp
net.ipv4.tcp_congestion_control=htcp
This seemed to gives us a much better speed that was steady and acceptable for us, as the LES link is used by various other things.

So I was trying to enable some more volume replications and I came across this odd error:

fc-replciationonly

I couldn’t figure out why all of a sudden it would not show my iSCSI replication options. All my current replications are iSCSI and are working properly. I had a look at all the iSCSI settings/links and they all showed up and working.

So I  gave support a call and they did the same checks that I had done and they said in previous instances of this happening, restarting the Enterprise Manager Service fixes the issue.

So we did that and we were back in business!

You can restart the service in two places:

fc-replciationonly_2

Or in the Windows Services section:

fc-replciationonly_1

Once that was done, we were able to see:

fc-replciationonly_3

 

All was well int he world again!

I had this issue where VMware tools would fail to install with:

The Windows Installer Service could not be accessed. if I did it normally by running the installer

If I tried to run it interactively I got this:

tools_update_error2

after digging around, and looking online, and nothing seeming to match, I had a look at the installer service and saw:

tools_update_error1

and behold, once I turned the service back on it installed fine lol

It looks like it was turned off because on some windows boxes, it consumes all the cpu and the only way to get round this was to disable the service on these boxes. These boxes handle lots of processing for various things, and the installer is only enabled when needed for patching and updates etc.