Five Pitfalls In PCIe-Based NVMe Controller Verification

Knowing where trouble spots lie is vital in creating a verification plan to avoid them.


Non-Volatile Memory Express (NVMe) is gaining rapidly in mindshare among consumers and vendors. Some industry analysts are forecasting that PCIe-based NVMe will become the dominant storage interface over the next few years. With its high-performance and low-latency characteristics, and its availability for virtually all platforms, NVMe is a game changer. For the first time, storage devices and storage subsystems have a fundamentally different way to operate with host computers, unlike any previous storage protocol.

NVMe is an optimized, high-performance scalable host controller interface designed to address the needs of enterprise and client systems that utilize PCI Express-based solid-state storage. Designed to move beyond the dark ages of hard disk drive technology, NVMe is built from the ground up for non-volatile memory (NVM) technologies. NVMe is designed to provide efficient access to storage devices built with non-volatile memory — from today’s NAND flash technology to future, higher performing, persistent memory technologies.

With the rise of NVMe, verification engineers must be aware of the common pitfalls experienced while verifying PCIe based NVMe controllers. By knowing the areas where the probability of these occurring is highest engineers can create a verification plan checklist to avoid them with confidence.

Common NVMe verification pitfalls

1. Not testing register properties
At first glance, controller register testing may seem obvious as registers form basic functionality, but their thorough testing is often ignored. If one looks at the problem more closely, it can be seen that it can be difficult to detect subtle problems with merely a simple test.

Some of the areas to focus on while writing register tests include:

  • Default values as per NVMe specification
  • Hardware initialized values (often defined by design specification)
  • Attributes like read-only, write-1-to-clear, etc.
  • Most importantly, reset values and different resets associated with each register

2. Failing to explore queue related non-trivial scenarios
As per NVMe protocol basics, queues are defined as circular in nature, and any entry is added to queue at the current tail location and popped from current head location. Also, the tail pointer is incremented by one for each entry pushed into the queue, and the head is incremented by one for each entry popped from the queue. When these pointers reach the maximum queue depth, they are rolled back to the point’s first location in their respective queues, this is a wrapping condition.

Since NVMe is based on a paired Submission and Completion Queue mechanism, commands are placed by host software into a Submission Queue and completions are placed into the associated Completion Queue by the controller. This queue pairing is an important aspect not to be left unexplored.

One of the gray areas where issues are seen frequently is many-to-1 queue mapping, where Multiple Submission Queues (SQ) may utilize the same Completion Queue (CQ). It is observed in some designs where such relationships exist that controllers are prone to report incorrect Submission Queue ID and Command ID fields in the posted completion queue entry.

The next point of interest is testing minimum depth conditions and maximum depth conditions. Although theoretically maximum queue depths can be 64K, the depth is limited by the maximum supported by the controller, as reported in register CAP.MQES. It is important to remember that the total number of entries in a queue when full is one less than the queue size.


Figure 1: Controller queues.

Another important scenario is when I/O queues are non-contiguous. It is imperative to note that when configured as a non-contiguous queue, each PRP Entry in the PRP List shall have an offset of 0h. If there is a PRP Entry with a non-zero offset, then the controller should return an error of PRP Offset Invalid. Also, creating an I/O Submission Queue that utilizes a PRP List is only valid if the controller supports non-contiguous queues as indicated in CAP.CQR register. Scenarios related to queue rollover situations cannot be overlooked and must be included in verification plan.

3. Not focusing on controllers attached to PCIe virtual functions
Single Root I/O Virtualization (SR-IOV) is a specification that allows a PCIe device to appear to be multiple separate physical PCIe devices. It introduces the idea of virtual function (VF), in addition to physical function (PF), that can be considered “lightweight” functions that lack configuration resources. VFs implement a subset of traditional configuration space and share some configuration space with the associated PF.

Most common problems faced in this category are related to the BDF calculation, where BDF means bus, device, and function number. Once the VF Enable field is set in SR-IOV, VFs are created and will respond to configuration transactions. They will not, however, be discovered automatically by legacy enumeration software. A new mechanism exists in SR-IOV that allows SR-IOV compliant software to locate VFs. The First VF Offset and VF Stride create a linked list that starts at a PF that can be used to identify the location of all the VFs associated with a particular PF. During this newer enumeration process, a unique BDF is assigned to each implemented VF along with the PF.
Figure 2 illustrates an NVMe subsystem that supports SR-IOV and has one PF and four VFs. An NVMe controller is associated with each VF, with each controller having a private namespace and access to a namespace shared by all controllers, labeled NS E.


Figure 2: PCIe device supporting Single Root I/O Virtualization.

Another error-prone configuration is when more than 256 VFs are supported per PF. SR-IOV allows a device to implement 100s of VFs. But to do this, a SR-IOV device may request the software to allocate more than one bus number (in order to support more than 256 functions).

Some other primitives to consider while verifying SRIOV based NVMe controllers are:

  • At power on, VFs do not exist and cannot be accessed through configuration
  • Prior to accessing or assigning VFs, they must be configured and enabled through the SR-IOV capability located in associated PFs.
  • The SR-IOV capability structure in the PF PCI Configuration space includes a system page size field that the VMM should set to the size supported by the platform.
  • Memory spaces for the VFs are concatenated in a contiguous memory space located by the VF base address
  • In order to ensure isolation of the individual memory spaces, VFs must align their memory resource to the page protection boundaries provided by the
  • Although theoretical maximum number of VFs that can be attached to a physical function (PF) is 256 (when bus number is not changed) but actual supported are limited by the “InitialVFs” and “NumVFs” fields in the configuration

4. Overlooking interrupt and it’s mask handling complexity
NVMe allows a controller to be configured to signal interrupt in four modes. These modes are pin-based, single message MSI, multi message MSI, and MSI-X. While pin and MSI uses NVMe defined registers for interrupt mask/clear, MSI-X uses a mask table defined as part of PCIe specifications. Another point to consider is that pin and single MSI use only a single vector, multi-MSI can use up to 32 vectors, while MSI- X supports a maximum of 2K interrupt vectors.

Below are the mode of interrupt operations if any of the following conditions are met:

  • Pin based interrupts are being used – MSI (MSICAP.MC.MSIE=’0’) and MSI-X are disabled
  • Single MSI is being used – MSI is enabled (MSICAP.MC.MSIE=’1’), MSICAP.MC.MME=0h, and MSI-X is disabled
  • Multiple MSI is being used – Multiple-message MSI is enabled (MSICAP.MC.MSIE=’1’) and (MSICAP.MC.MME=1h) and MSI-X is disabled
  • MSI-X is being used – (multiple-message) MSI is disabled (MSICAP.MC.MSIE=’0’) and (MSICAP.MC.MME=0h) and MSI-X is enabled

While verifying interrupt mechanisms, the following premises should hold true concurrently:

  • There is one or more unacknowledged completion queue entries in a Completion Queue that utilizes this interrupt vector
  • The Completion Queue(s) with unacknowledged completion queue entries has interrupts enabled in the “Create I/O Completion Queue” command
  • For pin and MSI interrupts, The corresponding INTM bit exposed to the host is cleared to 0, indicating that the interrupt is not masked
  • When MSI-X are used, the function mask bit in the MSI-X Message Control register is not set to 1 and the corresponding vector mask in the MSI-X table structure is not set to 1

Though interrupt processing and mask handling form basic functionality for signaling command completion by controllers, and this feature is tested in every NVMe command process, some specific scenarios are worth considering while creating a test plan.

Case 1: For SRIOV enabled devices the following points should be acknowledged. PFs may implement INTx, but VFs must not implement INTx. PFs and each PF and VF must implement its own unique interrupt capabilities.

Case 2: Interrupt aggregation, also referred to as interrupt coalescing, mitigates host interrupt overhead by reducing the rate at which interrupt requests are generated by a controller. Controllers supporting coalescing may combine multiple interrupt triggers for a particular vector.

5. Misunderstanding metadata concepts
Metadata is contextual information about a particular LBA of data. It is additional data allocated on a per logical block basis. There is no requirement for how the host makes use of the metadata area. One of the most common usages for metadata is to convey end-to-end protection information. The metadata may be transferred by the controller to or from the host in one of two ways. One of the transfer mechanisms shall be selected for each namespace when it is formatted; transferring a portion of metadata with one mechanism and a portion with the other mechanism is not supported.

The first mechanism for transferring the metadata is as a contiguous part of the logical block that it is associated with. The metadata is transferred at the end of the associated logical block, forming an extended logical block. This mechanism is illustrated in Figure 3. In this case, both the logical block data and logical block metadata are pointed to by the PRP1 and PRP2 pointers (or SGL Entry 1 if SGLs are used).


Figure 3: Metadata as extended LBA

There is likeliness of erroneous behavior in this scenario, where the controller sends a fewer number of total bytes than expected. Example: If LBA data size is 512B and metadata per block is configured as 16B, the expected total data transfer for 1 logical block is 512 + 16=528B. Now when PRP is used, this transfer of 528B is indicated by specifying the number of logical blocks to be written or read (Command DW12 bits 15:0, NLB), and the total number of bytes is not explicitly specified anywhere. But sometimes the controller may be at fault sending a fewer number of bytes than this expected 528B. Other factors should also be considered while calculating metadata bytes, like current LBA format number, metadata support in selected namespace, etc.

Another scope of committing a mistake in an extended LBA mechanism is with SGL transfers. While for PRPs, it is the controller that internally manipulates the total number of bytes to be transferred, for SGLs the total bytes, including data and metadata, should be set within the length field of SGL descriptors.

The second mechanism for transferring the metadata is as a separate buffer of data. This mechanism is illustrated in Figure 4. In this case, the metadata is pointed to with the Metadata Pointer, while the logical block data is pointed to by the Data Pointer.


Figure 4: Metadata as separate buffer

The indispensable characteristics to test while testing metadata as a separate buffer mode are:

  • When PRPs are used for transfer (PSDT is 0b), the metadata is required to be physically contiguous and MPTR contain the address of a contiguous physical buffer of metadata and shall be Dword
  • When SGLs are used for transfer and PSDT is 01b, the metadata is pointed by MPTR, which contain the address of a contiguous physical buffer of metadata and is byte
  • When SGLs are used for transfer and PSDT is 10b, the metadata is pointed by MPTR, which contain the address of an SGL segment containing exactly one SGL Descriptor and shall be Qword aligned. If the SGL segment is a Data Block descriptor, then it describes the entire data transfer.

Considering the number and complexity of features supported by the NVMe specification, it is no surprise that the verification effort is huge. Fortunately, the NVMe Questa Verification IP (QVIP) sequence library makes it easy to verify these problematic scenarios easy and contributes to accelerating verification cycles. By focusing their verification effort on five key areas and applying NVMe Questa Verification IP (QVIP), verification engineers will save themselves a lot of trouble.

To learn more about NVMe QVIP and other supported protocols, please visit us at


Amith M says:

Wonderfully compiled article.

Leave a Reply

(Note: This name will be displayed publicly)