VMware NVMe Memory ...
 
Notifications
Clear all

VMware NVMe Memory Tiering Questions Answered (Memory over NVMe)

5 Posts
2 Users
0 Reactions
100 Views
Brandon Lee
(@brandon-lee)
Posts: 340
Member Admin
Topic starter
 

I have had several questions about NVMe memory tiering since writing several posts about it and featuring this on YouTube videos as of recently. I wanted to address some of those here as I think it will be helpful to share my findings with using NVMe memory tiering now for a few weeks. Hopefully this will help you decide if you want to play around with it or not.

Can I use this in versions prior to vSphere 8.0 Update 3?

No, you will have to be upgraded to vSphere 8.0 Update 3 ESXi to use the feature.

Do you have to have vCenter Server to use memory tiering?

No, you can turn this on with standalone ESXi hosts and it doesn't require vCenter Server to make use of it.

Isn't this just paging like we have seen in Linux for years?

No, evidently not. VMware is using lots of logic underneath the hood to place memory pages more intelligently than simple memory paging to reduce page faults and many other benefits. It can decide which pages need to be in the quickest DRAM and which ones can exist on NVMe disk.

Does it require a special license?

It is currently a tech preview so there is no license needed to play around with this in the home lab if you have a VMUG subscription, etc. I suspect this will be tied to the new VVF or VCF licensing structure.

Does NVMe memory tiering require a whole NVMe drive?

Yes it does. You basically tell NVMe tiering which drive you want to use for this purpose and it sets it up as such. So you can't use just a part of the drive with a partition, etc. It will create its own partition structure on the drive for this purpose.

Can you have more than one NVMe tiering drive?

Yes, I inadvertently setup two drives not realizing it as I mistakenly copied and pasted the wrong drive ID and already had a drive allocated. Both of them were marked for NVMe memory tiering. I am not exactly sure how it is used when more than one is configured. Possibly for redundancy.

Are there limitations when using the NVMe memory tiering feature?

Yes there are. There are a few that I would like to mention below:

  • You can't use storage migration - When attempting to migrate VMs from one datastore to another, I noticed the storage migration fails. When I disabled NVMe tiering for the host, the storage migration was successful. This has a limitation I believe due to snapshots which I will mention below
  • You can't use "with memory" snapshots - You can only capture snapshots without the memory option. This is currently a limitation of NVMe memory tiering and I suspect it is the reason storage migrations fail
  • You can't do nested virtualization - You can't setup a VM on a memory tiering-enabled host to run a VM with nested virtualization enabled

I am curious if any of you have tried/are trying out NVMe memory tiering. Have you discovered any limitations to note outside of the list above? @jnew1213 I know you had mentioned you were trying this. Have you run into any "gotchas" so far?

 
Posted : 18/09/2024 7:44 am
JNew1213
(@jnew1213)
Posts: 16
Eminent Member
 

I have memory tiering enabled on a second MS-01 I bought just for this purpose.

I also discovered that svMotions fail without much explanation. Storage migration for a powered-off VM works fine.

There may or may not be an issue with a Docker instance I have running on an Ubuntu Server VM. Within Docker, I have an installation of Dashy, which is the dashboard I use for most things here. Dashy is set to display when I open a new tab in my browser.

After a couple of days I've noticed, repeatedly, the Ubuntu VM OR Docker OR Dashy becoming less and less responsive, sometimes to the point where it doesn't load and I have to manually refresh the screen to get it to do so.

I don't know in which component this is happening or what's causing it. As of now, I have moved the Ubuntu VM (with Docker and Dashy) back to the MS-01 that doesn't have memory tiering enabled, and the VM has been fine for several days. I will eventually move it back to the memory tiered machine to see if the issue recurs.

Yesterday, I powered on for patching an infrequently used secondary vCenter appliance on the memory tiered machine. I will watch how this fairly large and complex VM behaves on this machine. I already have one vCenter running on the machine, and it seems fine.

I think there's a bit of development work still to go on the memory tiering/M.2 thing. One feature would be tiering of large memory pages. Another would be partitioning just a portion of the M.2 for tiering and leaving the rest of it either for the ESXi installation or for use as a datastore. Additional development might allow vMotion/svMotion of certain VMs to/from memory tiered machines that we can't currently with the preview. This includes VMs that have memory reservations.

I am not sure that memory tiering in ESXi is the "game changer" for home labs that I saw it called. But it might turn out to be a handy thing to have on a -- non-critical -- virtual machine host that is typically short on memory but still has processing power to spare.

 
Posted : 18/09/2024 8:12 am
Brandon Lee
(@brandon-lee)
Posts: 340
Member Admin
Topic starter
 

@jnew1213 Great observations! And glad to know someone else making use of it. I definitely see A LOT of potential with this technology, but as mentioned, there are a lot of items that are still quite a bit rough around the edges. 

It will be great to see how it develops, but definitely lots of features that need to be added or "turned on" with this technology. It makes me wonder if VMware already has the ability to do svmotions, etc but just not enabled in this tech preview likely. I also like the idea of being able to use a "partial" device partition for tiering so you can take advantage of the rest of the drive as well. But wondering if they will ever do that since caching devices are I guess best practice best to be dedicated to that task?

I did have a weird failure on my MS-01 with vCenter the other day (my main home lab vcenter) where I woke up and monitoring was dinging me about all VMs being down, well it was just vcenter, but console had CPU unrecoverable errors on the console. A quick reboot fixed the issue. Not sure if this is related to NVMe tiering, or if it is a byproduct of the hybrid CPU architecture that caused an issue like that. 

Let me know if you see anything similar to that as well

 
Posted : 18/09/2024 8:21 am
JNew1213
(@jnew1213)
Posts: 16
Eminent Member
 

I had a weird issue yesterday that I've been reminded of while reading your last post.

A network cable was severed (by a rabbit) yesterday morning and that brought down a number of things including my Plex server. Replacing the cable restored everything, or so I thought.

A friend emailed me late yesterday that Plex seemed to be down. I checked and, sure enough, the VM had hung on restarting at a point before the operating system loaded. Powering the VM off and on again quickly fixed the problem.

I've never seen a VM hang where it did before, at the looking for network boot phase.

This particular VM, Windows 11, is on the MS-01 with memory tiering in effect, and has an Intel ARC 310 GPU passed through to it.

As I said, I haven't seen this kind of hang before, but I'll be keeping my eyes open for it happening again going forward.

Ah! One other thing that needs to be built into memory tiering: M.2 lifecycle management. Currently we don't know how much life any M.2 used by ESXi has left. (iDRAC, etc. has an estimate on properly initialized datacenter class devices.) It's probably less of an issue when used as a datastore, but when used as "memory" an M.2 that spontaneously fails can possibly corrupt a lot of stuff!

 
Posted : 18/09/2024 8:41 am
Brandon Lee
(@brandon-lee)
Posts: 340
Member Admin
Topic starter
 

@jnew1213 Wow that is an interesting story you will have to tell me at some point. I have had a lot of things happen with network cable but never a rabbit Smile Definitely keep us posted if you see any more VM hanging behavior. I do think with the technical preview (as we know it will be) things are not quite as stable with tiering turned on.

Also, that is another great point about lifecycle management for M.2 devices, especially for m.2 devices used for tiering. These do become super critical at that point, so a failure needs to be handled carefully. I am sure these are probably things their engineering team is working through hopefully. Will be cool to see how this feature evolves. 

I do know too, as I mentioned in the first list, you can mark multiple devices as tier devices. I didn't find any specific guidance or details on this as of yet, but wondering if both devices can be used for redundancy possibly? More questions I guess!

 
Posted : 18/09/2024 9:59 pm