So basically, while I was in the process of updating a host. I came across this error on boot-up:
The iDRAC reported all the memory DIMMS were fine and working, which I thought was odd.
So the normal procedure for testing the DIMM is to swap the DIMM and put it into a different slot, and see if the error follows the DIMM or not. If it does then you have a faulty DIMM and if it doesnt you most likely have a faulty motherboard.
So I called Dell ProSupport and they confirmed this was the way to go, with a few extra tests. Initally they wanted me to do a full memeory test of all the DIMMs, but this would have taken a very long time and left the host out of action for that whole time.
So they said:
- Move all the DIMMs from bank B to bank A and the DIMMs in bank A to bank B, and keep an eye on the the supposedly faulty DIMM in B1.
- Move the Faulty DIMM which is now in A1 to another slot
By doing this we were testing to see if it was actually the faulty DIMM, a CPU issue, or motherboard issue.
So I did the initial swap round, and the server booted up with a CPU error, which could easily have been due to the faulty DIMM. So I continued through the other test, and after moving the faulty DIMM around some more, the error followed the DIMM.
So I reported my findings to Dell and they sent out a replacement DIMM, I swapped this out and everything is now working FINE!
The reason the iDRAC still showed the faulty DIMM as being fine was because, the MEMBIST test is like an early warning of a potential failure.