This was one I hadn’t seen in a while to be fair, and caught me a bit off guard, but it is well documented.

Cloned Windows 2008 R2 virtual machine fails to boot with the error: autochk program not found – skipping AUTOCHECK (2004505)

What was interesting for me was this happened on ESXi 5.5 hosts, which the KB article doesn’t mention, but the VM hardware level was v7, so my guess is that’s how its correlated.

So the KB mentions doing this:

If these steps do not resolve the issue, try this alternate workaround:
  1. Power down the source virtual machine.
  2. Boot the virtual machine using the Windows Server 2008 R2 .iso file.
  3. In the Installation Wizard, select Repair your Computer. For more information, see the Microsoft Knowledge Base article 2261423.

    Note: The preceding link was correct as of August 18, 2011. If you find the link is broken, provide feedback and a VMware employee will update the link.

     

  4. Select Command Prompt.
  5. Run these commands in the specified order:

    diskpart
    list volume
    select volume 1
    attributes volume
    attributes volume clear nodefaultdriveletter

  6. Restart the virtual machine after removing the mounted .iso file.
  7. Clone the virtual machine again.

Now the VM I am wanting to clone is up and running and I dont want to take it down. So I did as follows:

  • I created a clone of the VM with no customisation
  • Mounted the ISO image on the newly cloned VM
  • Booted it up and ran diskpart exactly as described
  • Then did a clone of the clone, this time using the customisation script to sort out the SSIDs get, and it worked perfectly.

I then just binned off the interim clone!

So I was looking through a datastore and was Storage vMotioning a few VMs and it was taking ages. After digging around I noticed that the datastore was mapped across 2 extents.

extenst_3 extenst_2 extenst_1

extenst_3 extenst_2

There should be no reason for this, esp when using VMFS5. After digging around in Compellent by looking at the LUN ID and the naa ID, I found out that both extents were on different Compellent SANs! I have no idea how this came about, but the only way I know of to fix it is to actually, Storage vMotion everything off it to a new volume/datastore, unmount the old datastore and detach the extents.

extent3 extent6

Somebody must have done that in error when trying to extend datastore space. The second extent was labelled as belonging to a totally different area/volume. So how it exactly came about I have no idea.

But it is my job to fix it!

Also VMware did a good blog post on extents and the many misconceptions on then:

https://blogs.vmware.com/vsphere/2012/02/vmfs-extents-are-they-bad-or-simply-misunderstood.html

Misconception 2 is interesting as I was discussing with someone recently too, I have heard it a few times.

Nice to get it cleared up.

So we basically have an archive enclosure that has 40TB usable of 4TB 7k disks. The problem was it was getting full.

Now if you know how Compellent works, you will know all writes come in at RAID10/RAID10-DM

Since this archive enclosure is classed as its own pool, and these 7k disks are the only disks that are part of that pool, it is classed as Tier 1.

So writes come in at RAID10-DM Dual Mirror, as DM is used by Compellent when you use large disks, it helps to mitigate the risk of using larger disks while still allowing the use of RAID-10. Normal RAID10 has a write penalty of 2 because data is written twice, with Dual Mirror there is a penalty of 3 because data is written 3 times.

As you can see from the screen shot below:

raid_troubles

It was pretty full, the issue with RAID10 is that it consumes a load of space. Since this is a archive tier, write speed is of little importance, so I decided to set the volumes on this set of disks to use just RAID6-10. Once again with Compllenet with large disks you only have the option to use RAID6-10, as it helps mitigate the issue of the larger disks failing and their long rebuild times.

So to do this you need to adjust the storage profile used by these volumes, so I went into Enterprise Manager and created a new storage profile:

raid_troubles5

and added the volumes to it, then all you can do is wait. Data Progression that runs every day 7pm-7am by default, will then start moving the data out of RAID10-DM into RAID 6-10. This will take some time, as you can see from the screenshots below:

Start

raid_troubles2

Midway

raid_troubles3_midway

End

raid_troubles4

If you want the the volumes to take the new profile right away, you have to do a copy/mirror/migrate, so as the data is moved to the new volume it is put into the correct storage layer/profile.

I did consider doing a migrate (as I have used it before), migrate is a great feature, basically you create a new volume thats the same size as the current one, then select migrate on the original volume, select the new volume you want to migrate to, and Compllent on the back end will move the data to the new volume, incorporating the new volume storage profile.

This will be totally transparent to the front end hosts/servers etc, and when complete the new volume will inherit all the mappings LUN IDs etc and no one should know the different. Obviously during this process there will be some latency, there is no way round this, but apart from that its totally transparent.

The reason I couldn’t do this was that the space was already limited. Now Compllenet should adjust the RAID6 layers to increase them while reducing the size of the RADI10-DM allocated levels, as and when needed.

“Yes you can change the Storage Profile of a volume at any point in time while it’s online. Any new write or a change to an existing block will take on the new write characteristics immediately. The remainder of the volume will start to change it’s profile during the next Data Progression cycle. (if you want the whole thing to take on the new characteristics then you need to do a CMM, copy mirror migrate). The only thing to be concerned about is having enough capacity of the new tier and IO ability in that tier to take on the workload.”

As you can see from the screen shots, the RAID10DM disk space is still allocated, and for me I would rather that disk space to be allocated to RADI6-10 giving a much better use of the available disks. This is something only Compellent Co=Pilot Support can do, as they have to do trimming on the back end.

What they do is they de-allocate the space from RAID-10DM and mark it as free, then the controller can allocate it to RAID6-10 as needed. Since the Storage Profiles are set to use RAID6, as RAID6 is getting full it will dynamically expand RAID6-10 to accommodate.

So I was tasked with cloning a VM and I got this error:
clone-error2
I then came across this:
Cloning a virtual machine and removing the virtual Distributed Switch backed network adapter fails in VMware vCenter Server 5.1 with the error: The resource ‘X’ is in use (2064681)
The thing is I am using 5.5u3b, but the VDS was still set to 5.1.
It mentioned that it was a VDS issue, so I had a look at the source VM and look at the port its using:
clone-error1
 
As you can see its 10800 which matches the error.
 
Obviously you cant clone it to use the same port as the source as its already in use, clearly this was a glitch, now the KB says its impacts 5.1 but I am on 5.5, but this VM was using 5.1 hardware and 5.1 tools and the VDS was 5.1 too.
So anyway the best course of action was to remove the NIC during customisation and add a new NIC in at the same time into the same port groups etc, this will assign it a new port.
 
You could always click switch to advanced settings and put in a port number that you know is not in use (check VDS > Ports for whats free), but its just simpler to remove and add a new vNIC at the same time.
Since this happened the VDS has been upgraded to 5.5, and I haven’t come across this issue since

Ah yes the oldie but goldie that is resizing replicated disks.

vsphere_replication_disk

So the way round this is covered in various places

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2042790

So what I did was:

  • Pause replication
  • Log onto the host the VM is replicated to at the DR site
  • I went to its datastore and renamed the folder with a -bak extension. You have to do this so that when you disable replication it does not delete the actual files at the DR site. Otherwise  you’d have to start replication from the beginning again, as you would have lost your seed!
  • You then disable replication, you will see the following error which is totally normal, because remember we renamed the folder at the DR site to stop it from being removed:vsphere_replication_disk2

So once you have done that, you can then increase the size of the disk at the Protected Site like you normally would by editing the VM settings. Then  the key part is, you then login to the host at the DR site, navigate to the datastore where the renamed folder is. You then run:

vmkfstools -X size vmdk

For example:

vmkfstools -X 50G Test-VM.vmdk

so in my case the disk was originally 550GB and had been extended by 50Gb to 600GB, so what I had to do here is increase the size of the vmdk files here to 600GB

vmkfstools -X 600GB servername_1.vmdk

vsphere_replication_disk1

Then once that has been done you rename the folder at the DR site back to what it was originally called.

Then you reconfigure replication at the source, and point it to the folder at the DR site, you will then be asked if you want it to seed, click yes.

Then what it will do is a compare and just send over the link the differences saving you a chunk of b/w!

Its a shame the process of extending a vmdk of a VM that is protected by vSphere Replication, is that complex.

You would hope that after all this time, it would have been simplified, but hey such is life!

 

EDIT:

There is another way you can resize a vSphere Replicated VM…that most people do not seem to be aware of:

Resize Virtual Machine Disk Files of a Replication that Does Not Use Replication Seeds

Basically you:

  • Run a planned migration
  • Stop replication
  • Resize the original source VM and then resize the recover VM (which is powered off)
  • Remove the recovered VM at the recovery site from the vCenter inventory (remove from inventory NOT delete from disk!)
  • Reconfigure replication and now use the recovery site VM as a seed

 

 

 

When I was trying to do a rescan using the SRA in SRM I came across this error:

SRA command ‘discoverDevices’ failed. Error running command on Enterprise Manager [Command: discoverDevices] [Error: StorageCenterError – Error getting ” Volume ” from SC [SN: 23,424 ] [Name: Storage Center 2 – Aberford Production ] (” IO Exception communicating with the Storage Center: No buffer space available (maximum connections reached?): connect “)]

I checked the Enterprise Manager logs for error details:

No buffer space available (maximum connections reached?)

I had a look at netstat:

sanent_hotfix1

as you can see from the screenshot, the max amount of sockets had been reached

After doing some digging, it appears its a knowing issue on 2008 r2 and there is a hot fix:

https://support.microsoft.com/en-us/kb/2577795

http://supportkb.riverbed.com/support/index/index?page=content&id=S23580&cat=VNE_SERVER&actp=LIST

https://support.microsoft.com/en-us/kb/2577795

I downloaded the hot fix and applied it to both my primary and secondary data collectors, as they both run 2008 r2. As always I took a snapshot first and then applied the hotfix.

The issues has not come back since!

Configuring the alarms isn’t all that easy and not documented that great.

In the vExpert slack I was chatting to @railroadmanuk Tim as he was trying to configure up some alarms, and this is something I have been meaning to dig into more myself.

So he started off and set me in the right direction, as you have to do it from the vCenter server . So vCenter Server > Manage > Alarm Definitions.

Then you create a new alarm.

I needed to create an alarm for when a new VM is put onto a replicated datastore and an alarm when a currently protected VM has been configured in a way, which would impact its protection status (eg new VMDK on a non replicated datastore or a new vNIC to a un-mapped port-group).

I had to fiddle with the alarms a bit to get the right mix of what I wanted, as I could get the alarm to show an alert that a new VM had been discovered, but I couldnt get it to clear manually. I eventually figured it out as shown:

new_srm_alarm1

For VM Discovered there is no direct opposite to that really, so I had to try a few other options till I found what worked well enough. Basically the testing involved moving a test VM to a replicated datastore and then moving it off that datastore and then also protecting the VM to confirm the alarm cleared in both cases.

The unprotected alarm was much simpler to work out:

new_srm_alarm2

Once they were done and tested I set them to email their alerts to the group mailbox for someone to pick up. The emails were set to send every 720mins which is basically every 12 hours.

 

I originally wrote this for http://vmusketeers.com/

When you use SRM (in my case SRM version 5.8.1) in combination with VR to protect your VMs, SRM will own these VMs. You will not be able to recover these protected VMs using vSphere Replication on its own, as long as they are SRM protected.  This became an issue for us, simply because we had an issue where the SRM service would fail to start at the DR site. Working with VMware support we got it working again but neither VMware Support or myself really knew what the root cause was.

A little SRA and SRM history

We are using SRM in combination with SRA’s and had a Recovery Plan that was stuck in partial re-protect mode and no amount of fiddling would get it sorted. To bypass this SRM protection, VMware support provided me with a tool that interrogates the SRM database and which removes all references to a specific Recovery Plan/Protection Group.  This tool worked well and removed all references, but as soon as I used it, the SRM service at the DR site would not start, and the SRM logs were of no help figuring out what happened. We reverted the SQL DB from backup at both sites and re-did it…again and again and again. Every time we had the exact same result at which point VMware said all that was left was to do a fresh install from scratch (urghhh).

Oddly enough when I tried it the next day it did work! The services started and no matter what I did, no matter how many times I tried to replicate the issue, I simply couldn’t.

VMware still cannot explain why this happened and neither can I. As great as it is that it is working, not knowing the root cause, troubles me quite a bit!

A VR solution

So to mitigate this issue from happening again, I moved away from array based replication and moved to VR where possible, because I didn’t want to rely on the SRA (everyone I speak to regardless of vendor has issues with the SRA), these days it seems to be recommended that if you can use VR you should as it just works.

So by using VR we knew that we could use it with SRM and also use it without SRM and recovery them one by one using VR on its own if needed. Worse case we could add the replicated VMs to the inventory ourselves.

Anyway I noticed that when clicking on a VM to recover on its own, that was also protected by SRM…..I simply couldn’t.

This puzzled me fore a while as I have always assumed that VR worked with SRM but could always be run independently regardless. I spoke to people using 6.0 and they said the option to do it manually was there but for some reason I just couldn’t no matter what I did.

srm-and-vr-screenshot-1

So after much digging and asking around no one really knew and it was puzzling, I knew if I removed the SRM protection I could recover the Vm using VR manually fine.

So after digging around int he SRM 5.8 admin guide on page 114 I found this:

http://pubs.vmware.com/srm-58/topic/com.vmware.ICbase/PDF/srm-admin-5-8.pdf page 114

Change vSphere Replication Settings

You can adjust global settings to change how Site Recovery Manager interacts with vSphere Replication. 

  1.  In the vSphere Web Client, click Site Recovery > Sites, and select a site.
  2. On the Manage tab, click Advanced Settings.
  3. Click vSphere Replication.
  4. Click Edit to modify the vSphere Replication settings.

Option Description

Allow vSphere Replication to recover virtual machines that are included in Site Recovery Manager recovery plans independently of Site Recovery Manager. The default value is false.

If you configure vSphere Replication on a virtual machine and include the virtual machine in a Site Recovery Manager recovery plan, you cannot recover the virtual machine by using vSphere Replication independently of Site Recovery Manager. To allow vSphere Replication to recover virtual machines independently of Site Recovery Manager, select the allowOtherSolutionTagInRecovery check box.

Now see this seems to be exactly what I was after….or so you would think heh.

After changing the setting to “true” at both sites for SRM, the issue still remained. I restarted the SRM/VC/Web service at both sites and I was still unable to manually recover VR VMs that were also protected by SRM.

Next I deleted the Protection Group for my test VMs and then they became tagged by VR again and I could recover manually, but adding them back to a Protection Group tagged them as SRM and I was unable to recover them manually again.

I already had a VMware Support Case open, from when we were dealing with the SRM service dying at our DR site. The engineer seemed to think there was some kind of glitch as everyone in support assumed that you could always recover VR replicated VMs manually in the GUI regardless of whether they were part of an SRM Protection Group.

I did some more digging and spoke to @Mike_Laverick  he wrote the book on SRM back on 5.0, but he hasn’t worked with SRM in a while, but he put me in touch with GS Khalsa @gurusimran who started to look into it and brought it up with the engineering team at VMware.

I decided to have a fiddle with the SRM HOL which is currently @6.1, on initial look it appeared it would let you fail over VR VMs manually even if tagged by SRM.

 

srm_vr_unabletodo_581png

But when you try the first screen stops you saying they are tagged by SRM.

srm_vr_unabletodo_581_pic2

The advanced setting made no difference either way! So exactly the same thing as I was experiencing in 5.8.1. From what I can tell, everyone assumes that since they can see the Red Play button in VR that it’ll allow them to recover the VM regardless of its SRM status!

After discussing it with GS and with him speaking to the engineering team, it actually looks like the documentation is incorrect, and as such he has opened an internal case to get the description for the advanced setting changed.

As from what he has been told it is more to do with how SRM interacts with other 3rd party solutions something along the lines of  “To allow SRM to recover VMs whose replications are managed by other solutions, check this box.”

 

I originally wrote this for http://vmusketeers.com/

So you can use vRDMs with vSphere Replication:

vSphere Replication and Virtual Raw Device Mappings

There aren’t many use cases for it and it isnt documented very well.

So I have just been pondering over something, vSphere replication supports VMs that have vRDMs but not pRDMs. ESXi5.5 now supports vRDMs up to 62TB. What it does, is it will bring the vRDM up as a VMDK file at the DR site.

I am throwing this out there as a possible option for replication of our VMs that have RDMs attached, since we have a few VMs that have RDM, that need to be replicated too.

“If you wish to maintain the use of a virtual RDM at the target location, it is possible to create a virtual RDM at the target location using the same size LUN, unregister (not delete from disk) the virtual machine to which the virtual RDM is attached from the vCenter Server inventory, and then use that virtual RDM as a seed for replication. However, this process is a bit more cumbersome – especially compared to what we just discussed above.”

So its possible even when using vRDMs at the target site as a seed, so in theory we could ditch Array based replication all together.

I always thought that if you used vRDMS it just brought them up as a VMDK at the DR site and that was that…clearly I was wrong!

We did ponder over moving to large VMDKS as over 2TB is unsupported now but:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2058287

You cannot hot-extend a virtual disk if the capacity after extending the disk is equal to or greater than 2 TB. Only offline extension of GPT-partitioned disks beyond 2 TB is possible.

We need to be able to extend these large disks on the fly, without downtime so that rules out large VMDKs.

Virtual Compatibility mode:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1007021

To safely expand the RDM:

1.Expand the RDM LUN from the SAN side.
2.Perform rescan on the ESX and verify the new LUN size is observed.

Recreate the RDM mapping to update the mapped disk size using one of these methods:

3a. Utilise Storage vMotion to migrate the Virtual RDM disk’s pointer file (vSphere 4.0 and later).

Or

3b. Remove the RDM file from the Virtual Machine and delete from disk. Power off the virtual machine, note the scsiX:Y position of the RDM in VM Settings. Navigate to VM Settings > Add > Hard Disk > RDM, select the scsiX:Y position that the RDM was using before and then power on the virtual machine.

Perform a re-scan from the guest operating system.

I’ve tried all the methods, a re-scan and Storage vMotion is the easiest and doesn’t involve any downtime.

You can then extend it in the OS and sVmotion it back to its original datastore

I attached a vRDM to my test VM, this stopped VR replication as it detected a new disk that needed to be configured.

  • I dumped some random text files on the RDM in Microsoft Windows
  • I then replicated the RDM volume via the SAN
  • Killed the replication and attached that volume to the DR cluster
  • I added the seed to the inventory and attached the replication vRDM to it using the same scsi ID
  • I then re configured replication and it picked it the seed

It then did a full sync comparing the data.

Even if there is no data replicated a full sync can take an age to do, as it does a checksum compare for integrity. For the about 500gb it takes on average 2 hours per VMDK .

So for very large vRDMS, a sync even when using seeds could run into hours and hours. Also bare in mind when you resize the disks it will do a full sync again, as for any VR replicated VMs to expand the disk you have pause replication resize the disk at both ends and then and then restart replication. So this could leave you without protection for a long time.

This for me was a deal breaker really, but I kept on testing.

I then extended the vRDM at the Protected Site by 10GB, I did the re-scan and Storage vMotioned the VM  and it came up in the host.

I paused replication and then did the resize at the DR site. I did it by manually adding the seed to the inventory and removing the vRDM from the VM at the DR site and adding it back in using the same SCSI ID (using option 3b from the list earlier), but when configuring it back up for replication I got a UUID mismatch, so I edited the VMDK file to match the source and then VR would re-sync.

KB on modifying UUIDs for seeds:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2041657

There is no documentation anywhere, about how to deal with resizing a vRDM that is protected by vSphere Replication, but I am assuming the way I did it is correct, or as close to being correct haha

With pRDMs it will give you a warning and let you configure up the rest of the VM but on completion it will say “A general system error occurred: No such device”.

I was hoping it would still replicate the rest of the VM and ignore the pRDM #sadtimes ha

Now this was more for experimentation on my end, so I could evaluate all options and see which worked best for me. As I have encountered my fair share of issues using the Dell Compellent SRA and vSphere Replication on the whole has been pretty flawless.

Run this script :

https://www.vmadmin.co.uk/resources/48-vspherepowercli/251-powerclinetworkoutputcsv

That gets you all the mac addresses for the hosts in the cluster.

Now follow what is in this KB:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=10051

vmkfstools -D /vmfs/volumes/UUID/VMDIR/LOCKEDFILE.xxx

run that on the vmdk file you cant delete you’ll get an output like

Lock [type 10c00001 offset 233148416 v 1069, hb offset 4075520

gen 57, mode 1, owner 570d1952-3933cca0-906d-bc305bf57cf4 mtime 8947925

num 0 gblnum 0 gblgen 0 gblbrk 0]

Addr <4, 552, 50>, gen 1050, links 1, type reg, flags 0, uid 0, gid 0, mode 600

Now match value in RED  against the list you have in that csv output you got earlier.
You will find the host that maintains the lock, log onto that host and see if there is a VM in an unknown state, if there is…… delete it.
If not log into the datastore from this host and delete the lcoked file and you should be able to just fine.