Discussion:
[edk2] [RFC 0/4] OvmfPkg: (almost) enable >= 64 GB guests
Laszlo Ersek
2015-06-08 21:46:47 UTC
Permalink
This is in response to

http://thread.gmane.org/gmane.comp.bios.tianocore.devel/15253
http://thread.gmane.org/gmane.comp.emulators.xen.devel/245623

It's an RFC only because the last patch doesn't actually work. Ideas
welcome.

Cc: Maoming <***@huawei.com>
Cc: Huangpeng (Peter) <***@huawei.com>
Cc: Wei Liu <***@citrix.com>

Thanks
Laszlo

Laszlo Ersek (4):
OvmfPkg: PlatformPei: enable larger permanent PEI RAM
OvmfPkg: PlatformPei: create the CPU HOB with dynamic memory space
width
OvmfPkg: PlatformPei: beautify memory HOB order in QemuInitializeRam()
OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam()

OvmfPkg/PlatformPei/Platform.h | 7 +++
OvmfPkg/PlatformPei/MemDetect.c | 115 +++++++++++++++++++++++++++++++++++-----
OvmfPkg/PlatformPei/Platform.c | 7 ++-
3 files changed, 114 insertions(+), 15 deletions(-)
--
1.8.3.1


------------------------------------------------------------------------------
Laszlo Ersek
2015-06-08 21:46:48 UTC
Permalink
We'll soon increase the maximum guest-physical RAM size supported by OVMF.
For more RAM, the DXE IPL is going to build more page tables, and for that
it's going to need a bigger chunk from the permanent PEI RAM.
DXE IPL Entry
Loading PEIM at 0x000BFF61000 EntryPoint=0x000BFF61260 DxeCore.efi
Loading DXE CORE at 0x000BFF61000 EntryPoint=0x000BFF61260
AllocatePages failed: No 0x40201 Pages is available.
There is only left 0x3F1F pages memory resource to be allocated.
BigPageAddress != 0
(The above example belongs to the artificially high, maximal address width
of 52, clamped by the DXE core to 48. The address width of 48 bits
corresponds to 256 TB or RAM, and requires a bit more than 1GB for paging
structures.)

Cc: Maoming <***@huawei.com>
Cc: Huangpeng (Peter) <***@huawei.com>
Cc: Wei Liu <***@citrix.com>
Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Laszlo Ersek <***@redhat.com>
---
OvmfPkg/PlatformPei/Platform.h | 7 +++++
OvmfPkg/PlatformPei/MemDetect.c | 61 +++++++++++++++++++++++++++++++++++++++--
OvmfPkg/PlatformPei/Platform.c | 1 +
3 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/OvmfPkg/PlatformPei/Platform.h b/OvmfPkg/PlatformPei/Platform.h
index 31640e9..8b6a976 100644
--- a/OvmfPkg/PlatformPei/Platform.h
+++ b/OvmfPkg/PlatformPei/Platform.h
@@ -59,6 +59,11 @@ AddUntestedMemoryRangeHob (
EFI_PHYSICAL_ADDRESS MemoryLimit
);

+VOID
+AddressWidthInitialization (
+ VOID
+ );
+
EFI_STATUS
PublishPeiMemory (
VOID
@@ -100,4 +105,6 @@ extern EFI_BOOT_MODE mBootMode;

extern BOOLEAN mS3Supported;

+extern UINT8 mPhysMemAddressWidth;
+
#endif // _PLATFORM_PEI_H_INCLUDED_
diff --git a/OvmfPkg/PlatformPei/MemDetect.c b/OvmfPkg/PlatformPei/MemDetect.c
index bd7bb02..c713162 100644
--- a/OvmfPkg/PlatformPei/MemDetect.c
+++ b/OvmfPkg/PlatformPei/MemDetect.c
@@ -36,6 +36,8 @@ Module Name:
#include "Platform.h"
#include "Cmos.h"

+UINT8 mPhysMemAddressWidth;
+
UINT32
GetSystemMemorySizeBelow4gb (
VOID
@@ -84,6 +86,47 @@ GetSystemMemorySizeAbove4gb (
return LShiftU64 (Size, 16);
}

+
+/**
+ Initialize the mPhysMemAddressWidth variable, based on guest RAM size.
+**/
+VOID
+AddressWidthInitialization (
+ VOID
+ )
+{
+ UINT64 FirstNonAddress;
+
+ //
+ // As guest-physical memory size grows, the permanent PEI RAM requirements
+ // are dominated by the identity-mapping page tables built by the DXE IPL.
+ // The DXL IPL keys off the physical address bits advertized in the CPU HOB.
+ // To conserve memory, we calculate the minimum address width here.
+ //
+ FirstNonAddress = BASE_4GB + GetSystemMemorySizeAbove4gb ();
+ mPhysMemAddressWidth = (UINT8)HighBitSet64 (FirstNonAddress);
+
+ //
+ // If FirstNonAddress is not an integral power of two, then we need an
+ // additional bit.
+ //
+ if ((FirstNonAddress & (FirstNonAddress - 1)) != 0) {
+ ++mPhysMemAddressWidth;
+ }
+
+ //
+ // The minimum address width is 36 (covers up to and excluding 64 GB, which
+ // is the maximum for Ia32 + PAE). The theoretical architecture maximum for
+ // X64 long mode is 52 bits, but the DXE IPL clamps that down to 48 bits. We
+ // can simply assert that here, since 48 bits are good enough for 256 TB.
+ //
+ if (mPhysMemAddressWidth <= 36) {
+ mPhysMemAddressWidth = 36;
+ }
+ ASSERT (mPhysMemAddressWidth <= 48);
+}
+
+
/**
Publish PEI core memory

@@ -99,6 +142,7 @@ PublishPeiMemory (
EFI_PHYSICAL_ADDRESS MemoryBase;
UINT64 MemorySize;
UINT64 LowerMemorySize;
+ UINT32 PeiMemoryCap;

if (mBootMode == BOOT_ON_S3_RESUME) {
MemoryBase = PcdGet32 (PcdS3AcpiReservedMemoryBase);
@@ -107,13 +151,24 @@ PublishPeiMemory (
LowerMemorySize = GetSystemMemorySizeBelow4gb ();

//
+ // For the minimum address width of 36, installing 64 MB as permanent PEI
+ // RAM is sufficient. For the maximum width, the DXE IPL needs a bit more
+ // than 1 GB for paging structures. Therefore we establish an exponential
+ // formula so that the 48-36+1=13 different widths map to permanent PEI RAM
+ // sizes in [64 MB, 2 GB], that is [1<<26, 1<<31]; 6 different powers.
+ //
+ PeiMemoryCap = SIZE_64MB << ((mPhysMemAddressWidth - 36) * 5 / 12);
+ DEBUG ((EFI_D_INFO, "%a: mPhysMemAddressWidth=%d PeiMemoryCap=%uMB\n",
+ __FUNCTION__, mPhysMemAddressWidth, PeiMemoryCap >> 20));
+
+ //
// Determine the range of memory to use during PEI
//
MemoryBase = PcdGet32 (PcdOvmfDxeMemFvBase) + PcdGet32 (PcdOvmfDxeMemFvSize);
MemorySize = LowerMemorySize - MemoryBase;
- if (MemorySize > SIZE_64MB) {
- MemoryBase = LowerMemorySize - SIZE_64MB;
- MemorySize = SIZE_64MB;
+ if (MemorySize > PeiMemoryCap) {
+ MemoryBase = LowerMemorySize - PeiMemoryCap;
+ MemorySize = PeiMemoryCap;
}
}

diff --git a/OvmfPkg/PlatformPei/Platform.c b/OvmfPkg/PlatformPei/Platform.c
index 1126c65..9634ce0 100644
--- a/OvmfPkg/PlatformPei/Platform.c
+++ b/OvmfPkg/PlatformPei/Platform.c
@@ -390,6 +390,7 @@ InitializePlatform (
}

BootModeInitialization ();
+ AddressWidthInitialization ();

PublishPeiMemory ();
--
1.8.3.1



------------------------------------------------------------------------------
Laszlo Ersek
2015-06-08 21:46:49 UTC
Permalink
Maoming reported that guest memory sizes equal to or larger than 64GB
were not correctly handled by OVMF.

Enabling the DEBUG_GCD (0x00100000) bit in PcdDebugPrintErrorLevel, and
starting QEMU with 64GB guest RAM size, I found the following error in the
GCD:AddMemorySpace(Base=0000000100000000,Length=0000000F40000000)
GcdMemoryType = Reserved
Capabilities = 030000000000000F
Status = Unsupported
This message is emitted when the DXE core is initializing the memory space
map, processing the "above 4GB" memory resource descriptor HOB that was
created by OVMF's QemuInitializeRam() function (see "UpperMemorySize").

The DXE core's call chain fails in:

CoreInternalAddMemorySpace() [MdeModulePkg/Core/Dxe/Gcd/Gcd.c]
CoreConvertSpace()
//
// Search for the list of descriptors that cover the range BaseAddress
// to BaseAddress+Length
//
CoreSearchGcdMapEntry()

CoreSearchGcdMapEntry() fails because the one entry (with type
"nonexistent") in the initial GCD memory space map is too small, and
GCD:Initial GCD Memory Space Map
GCDMemType Range Capabilities Attributes
========== ================================= ================ ================
NonExist 0000000000000000-0000000FFFFFFFFF 0000000000000000 0000000000000000
The size of this initial entry is determined from the CPU HOB
(CoreInitializeGcdServices()).

Set the SizeOfMemorySpace field in the CPU HOB to mPhysMemAddressWidth,
which is the narrowest valid value to cover the entire guest RAM.

Reported-by: Maoming <***@huawei.com>
Cc: Maoming <***@huawei.com>
Cc: Huangpeng (Peter) <***@huawei.com>
Cc: Wei Liu <***@citrix.com>
Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Laszlo Ersek <***@redhat.com>
---
OvmfPkg/PlatformPei/Platform.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/OvmfPkg/PlatformPei/Platform.c b/OvmfPkg/PlatformPei/Platform.c
index 9634ce0..e8b4b88 100644
--- a/OvmfPkg/PlatformPei/Platform.c
+++ b/OvmfPkg/PlatformPei/Platform.c
@@ -241,9 +241,11 @@ MiscInitialization (
IoOr8 (0x92, BIT1);

//
- // Build the CPU hob with 36-bit addressing and 16-bits of IO space.
+ // Build the CPU HOB with guest RAM size dependent address width and 16-bits
+ // of IO space. (Side note: unlike other HOBs, the CPU HOB is needed during
+ // S3 resume as well, so we build it unconditionally.)
//
- BuildCpuHob (36, 16);
+ BuildCpuHob (mPhysMemAddressWidth, 16);

//
// Query Host Bridge DID to determine platform type and save to PCD
--
1.8.3.1



------------------------------------------------------------------------------
Laszlo Ersek
2015-06-08 21:46:50 UTC
Permalink
Build the memory HOBs in a tight block, in increasing base address order.

Cc: Maoming <***@huawei.com>
Cc: Huangpeng (Peter) <***@huawei.com>
Cc: Wei Liu <***@citrix.com>
Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Laszlo Ersek <***@redhat.com>
---
OvmfPkg/PlatformPei/MemDetect.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/OvmfPkg/PlatformPei/MemDetect.c b/OvmfPkg/PlatformPei/MemDetect.c
index c713162..3ceb142 100644
--- a/OvmfPkg/PlatformPei/MemDetect.c
+++ b/OvmfPkg/PlatformPei/MemDetect.c
@@ -207,8 +207,11 @@ QemuInitializeRam (
//
// Create memory HOBs
//
- AddMemoryRangeHob (BASE_1MB, LowerMemorySize);
AddMemoryRangeHob (0, BASE_512KB + BASE_128KB);
+ AddMemoryRangeHob (BASE_1MB, LowerMemorySize);
+ if (UpperMemorySize != 0) {
+ AddUntestedMemoryBaseSizeHob (BASE_4GB, UpperMemorySize);
+ }
}

MtrrSetMemoryAttribute (BASE_1MB, LowerMemorySize - BASE_1MB, CacheWriteBack);
@@ -216,10 +219,6 @@ QemuInitializeRam (
MtrrSetMemoryAttribute (0, BASE_512KB + BASE_128KB, CacheWriteBack);

if (UpperMemorySize != 0) {
- if (mBootMode != BOOT_ON_S3_RESUME) {
- AddUntestedMemoryBaseSizeHob (BASE_4GB, UpperMemorySize);
- }
-
MtrrSetMemoryAttribute (BASE_4GB, UpperMemorySize, CacheWriteBack);
}
}
--
1.8.3.1



------------------------------------------------------------------------------
Laszlo Ersek
2015-06-08 21:46:51 UTC
Permalink
At the moment we work with a UC default MTRR type, and set three memory
ranges to WB:
- [0, 640 KB),
- [1 MB, LowerMemorySize),
- [4 GB, 4 GB + UpperMemorySize).

Unfortunately, coverage for the third range can fail with a high
likelihood. If the alignment of the base (ie. 4 GB) and the alignment of
the size (UpperMemorySize) differ, then MtrrLib creates a series of
variable MTRR entries, with power-of-two sized MTRR masks. And, it's
really easy to run out of variable MTRR entries, dependent on the
alignment difference.

This is a problem because a Linux guest will loudly reject any high memory
that is not covered my MTRR.

So, let's follow the inverse pattern (loosely inspired by SeaBIOS):
- flip the MTRR default type to WB,
- set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default
type and variable MTRRs, so we can't avoid this,
- set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs,
- set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs
more likely than the other scheme (due to less chaotic alignment
differences).

Effects of this patch can be observed by setting DEBUG_CACHE (0x00200000)
in PcdDebugPrintErrorLevel.

BUG: Although the MTRRs look good to me in the OVMF debug log, I still
can't boot >= 64 GB guests with this. Instead of the complaints mentioned
above, the Linux guest apparently spirals into an infinite loop (on KVM),
or hangs with no CPU load (on TCG).

Cc: Maoming <***@huawei.com>
Cc: Huangpeng (Peter) <***@huawei.com>
Cc: Wei Liu <***@citrix.com>
Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Laszlo Ersek <***@redhat.com>
---
OvmfPkg/PlatformPei/MemDetect.c | 43 +++++++++++++++++++++++++++++++++++++----
1 file changed, 39 insertions(+), 4 deletions(-)

diff --git a/OvmfPkg/PlatformPei/MemDetect.c b/OvmfPkg/PlatformPei/MemDetect.c
index 3ceb142..cceab22 100644
--- a/OvmfPkg/PlatformPei/MemDetect.c
+++ b/OvmfPkg/PlatformPei/MemDetect.c
@@ -194,6 +194,8 @@ QemuInitializeRam (
{
UINT64 LowerMemorySize;
UINT64 UpperMemorySize;
+ MTRR_SETTINGS MtrrSettings;
+ EFI_STATUS Status;

DEBUG ((EFI_D_INFO, "%a called\n", __FUNCTION__));

@@ -214,12 +216,45 @@ QemuInitializeRam (
}
}

- MtrrSetMemoryAttribute (BASE_1MB, LowerMemorySize - BASE_1MB, CacheWriteBack);
+ //
+ // We'd like to keep the following ranges uncached:
+ // - [640 KB, 1 MB)
+ // - [LowerMemorySize, 4 GB)
+ //
+ // Everything else should be WB. Unfortunately, programming the inverse (ie.
+ // keeping the default UC, and configuring the complement set of the above as
+ // WB) is not reliable in general, because the end of the upper RAM can have
+ // practically any alignment, and we may not have enough variable MTRRs to
+ // cover it exactly.
+ //
+ if (IsMtrrSupported ()) {
+ MtrrGetAllMtrrs (&MtrrSettings);

- MtrrSetMemoryAttribute (0, BASE_512KB + BASE_128KB, CacheWriteBack);
+ //
+ // MTRRs disabled, fixed MTRRs disabled, default type is uncached
+ //
+ ASSERT ((MtrrSettings.MtrrDefType & BIT11) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & BIT10) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & 0xFF) == 0);

- if (UpperMemorySize != 0) {
- MtrrSetMemoryAttribute (BASE_4GB, UpperMemorySize, CacheWriteBack);
+ //
+ // flip default type to writeback
+ //
+ SetMem (&MtrrSettings.Fixed, sizeof MtrrSettings.Fixed, 0x06);
+ ZeroMem (&MtrrSettings.Variables, sizeof MtrrSettings.Variables);
+ MtrrSettings.MtrrDefType |= BIT11 | BIT10 | 6;
+ MtrrSetAllMtrrs (&MtrrSettings);
+
+ //
+ // punch holes
+ //
+ Status = MtrrSetMemoryAttribute (BASE_512KB + BASE_128KB,
+ SIZE_256KB + SIZE_128KB, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
+
+ Status = MtrrSetMemoryAttribute (LowerMemorySize,
+ SIZE_4GB - LowerMemorySize, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
}
}
--
1.8.3.1


------------------------------------------------------------------------------
Laszlo Ersek
2015-06-09 02:15:08 UTC
Permalink
Post by Laszlo Ersek
At the moment we work with a UC default MTRR type, and set three memory
- [0, 640 KB),
- [1 MB, LowerMemorySize),
- [4 GB, 4 GB + UpperMemorySize).
Unfortunately, coverage for the third range can fail with a high
likelihood. If the alignment of the base (ie. 4 GB) and the alignment of
the size (UpperMemorySize) differ, then MtrrLib creates a series of
variable MTRR entries, with power-of-two sized MTRR masks. And, it's
really easy to run out of variable MTRR entries, dependent on the
alignment difference.
This is a problem because a Linux guest will loudly reject any high memory
that is not covered my MTRR.
- flip the MTRR default type to WB,
- set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default
type and variable MTRRs, so we can't avoid this,
- set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs,
- set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs
more likely than the other scheme (due to less chaotic alignment
differences).
Effects of this patch can be observed by setting DEBUG_CACHE (0x00200000)
in PcdDebugPrintErrorLevel.
BUG: Although the MTRRs look good to me in the OVMF debug log, I still
can't boot >= 64 GB guests with this. Instead of the complaints mentioned
above, the Linux guest apparently spirals into an infinite loop (on KVM),
or hangs with no CPU load (on TCG).
No, actually there is no bug in this patch (so s/RFC/PATCH/). I did more
testing and these are the findings:
- I can reproduce the same issue on KVM with SeaBIOS guests.
- The exact symptoms are that as soon as the highest guest-phys address
is >= 64 GB, then the guest kernel doesn't boot. It gets stuck
somewhere after hitting Enter in grub.
- Normally 3 GB of the guest RAM is mapped under 4 GB in guest-phys
address space, then there's a 1 GB PCI hole, and the rest is above
4 GB. This means that a 63 GB guest can be started (because 63 - 3 + 4
== 64), but if you add just 1 MB more, it won't boot.
- (This was the big discovery:) I flipped the "ept" parameter of the
kvm_intel module on my host to N, and then things started to work. I
just booted a 128 GB Linux guest with this patchset. (I have 4 GB
RAM in my host, plus approx 250 GB swap.) The guest could see it all.
- The TCG boot didn't hang either; I just couldn't wait earlier for
network initialization to complete.

I'm CC'ing Paolo for help with the EPT question. Other than that, this
series is functional. (For QEMU/KVM at least; Xen will likely need more
fixes from others.)

Thanks
Laszlo
Post by Laszlo Ersek
Contributed-under: TianoCore Contribution Agreement 1.0
---
OvmfPkg/PlatformPei/MemDetect.c | 43 +++++++++++++++++++++++++++++++++++++----
1 file changed, 39 insertions(+), 4 deletions(-)
diff --git a/OvmfPkg/PlatformPei/MemDetect.c b/OvmfPkg/PlatformPei/MemDetect.c
index 3ceb142..cceab22 100644
--- a/OvmfPkg/PlatformPei/MemDetect.c
+++ b/OvmfPkg/PlatformPei/MemDetect.c
@@ -194,6 +194,8 @@ QemuInitializeRam (
{
UINT64 LowerMemorySize;
UINT64 UpperMemorySize;
+ MTRR_SETTINGS MtrrSettings;
+ EFI_STATUS Status;
DEBUG ((EFI_D_INFO, "%a called\n", __FUNCTION__));
@@ -214,12 +216,45 @@ QemuInitializeRam (
}
}
- MtrrSetMemoryAttribute (BASE_1MB, LowerMemorySize - BASE_1MB, CacheWriteBack);
+ //
+ // - [640 KB, 1 MB)
+ // - [LowerMemorySize, 4 GB)
+ //
+ // Everything else should be WB. Unfortunately, programming the inverse (ie.
+ // keeping the default UC, and configuring the complement set of the above as
+ // WB) is not reliable in general, because the end of the upper RAM can have
+ // practically any alignment, and we may not have enough variable MTRRs to
+ // cover it exactly.
+ //
+ if (IsMtrrSupported ()) {
+ MtrrGetAllMtrrs (&MtrrSettings);
- MtrrSetMemoryAttribute (0, BASE_512KB + BASE_128KB, CacheWriteBack);
+ //
+ // MTRRs disabled, fixed MTRRs disabled, default type is uncached
+ //
+ ASSERT ((MtrrSettings.MtrrDefType & BIT11) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & BIT10) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & 0xFF) == 0);
- if (UpperMemorySize != 0) {
- MtrrSetMemoryAttribute (BASE_4GB, UpperMemorySize, CacheWriteBack);
+ //
+ // flip default type to writeback
+ //
+ SetMem (&MtrrSettings.Fixed, sizeof MtrrSettings.Fixed, 0x06);
+ ZeroMem (&MtrrSettings.Variables, sizeof MtrrSettings.Variables);
+ MtrrSettings.MtrrDefType |= BIT11 | BIT10 | 6;
+ MtrrSetAllMtrrs (&MtrrSettings);
+
+ //
+ // punch holes
+ //
+ Status = MtrrSetMemoryAttribute (BASE_512KB + BASE_128KB,
+ SIZE_256KB + SIZE_128KB, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
+
+ Status = MtrrSetMemoryAttribute (LowerMemorySize,
+ SIZE_4GB - LowerMemorySize, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
}
}
------------------------------------------------------------------------------
Laszlo Ersek
2015-06-10 13:03:05 UTC
Permalink
Post by Laszlo Ersek
Post by Laszlo Ersek
At the moment we work with a UC default MTRR type, and set three memory
- [0, 640 KB),
- [1 MB, LowerMemorySize),
- [4 GB, 4 GB + UpperMemorySize).
Unfortunately, coverage for the third range can fail with a high
likelihood. If the alignment of the base (ie. 4 GB) and the alignment of
the size (UpperMemorySize) differ, then MtrrLib creates a series of
variable MTRR entries, with power-of-two sized MTRR masks. And, it's
really easy to run out of variable MTRR entries, dependent on the
alignment difference.
This is a problem because a Linux guest will loudly reject any high memory
that is not covered my MTRR.
- flip the MTRR default type to WB,
- set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default
type and variable MTRRs, so we can't avoid this,
- set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs,
- set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs
more likely than the other scheme (due to less chaotic alignment
differences).
Effects of this patch can be observed by setting DEBUG_CACHE (0x00200000)
in PcdDebugPrintErrorLevel.
BUG: Although the MTRRs look good to me in the OVMF debug log, I still
can't boot >= 64 GB guests with this. Instead of the complaints mentioned
above, the Linux guest apparently spirals into an infinite loop (on KVM),
or hangs with no CPU load (on TCG).
No, actually there is no bug in this patch (so s/RFC/PATCH/). I did more
- I can reproduce the same issue on KVM with SeaBIOS guests.
- The exact symptoms are that as soon as the highest guest-phys address
is >= 64 GB, then the guest kernel doesn't boot. It gets stuck
somewhere after hitting Enter in grub.
- Normally 3 GB of the guest RAM is mapped under 4 GB in guest-phys
address space, then there's a 1 GB PCI hole, and the rest is above
4 GB. This means that a 63 GB guest can be started (because 63 - 3 + 4
== 64), but if you add just 1 MB more, it won't boot.
- (This was the big discovery:) I flipped the "ept" parameter of the
kvm_intel module on my host to N, and then things started to work. I
just booted a 128 GB Linux guest with this patchset. (I have 4 GB
RAM in my host, plus approx 250 GB swap.) The guest could see it all.
- The TCG boot didn't hang either; I just couldn't wait earlier for
network initialization to complete.
I'm CC'ing Paolo for help with the EPT question. Other than that, this
series is functional. (For QEMU/KVM at least; Xen will likely need more
fixes from others.)
We have a root cause, it seems. The issue is that the processor in my
laptop, on which I tested, has only 36 bits for physical addresses:

$ grep 'address sizes' /proc/cpuinfo
address sizes : 36 bits physical, 48 bits virtual
...

Which matches where the problem surfaces (64 GB guest-phys address
space) with hw-supported nested paging (EPT) enabled on the host.

In order to confirm this, a colleague of mine gave me access to a server
with 96 GB of RAM, and:

address sizes : 46 bits physical, 48 bits virtual

On this host I booted a 72 GB OVMF guest on QEMU/KVM, with EPT enabled,
and according to the guest dmesg, the guest saw it all.

Memory: 74160924K/75493820K available (7735K kernel code, 1149K
rwdata, 3340K rodata, 1500K init, 1524K bss, 1332896K reserved, 0K
cma-reserved)

Maoming: since you reported this issue, please confirm that the patch
series resolves it for you as well. In that case, I'll repost the series
with "PATCH" as subject-prefix instead of "RFC", and I'll drop the BUG
note from the last commit message.

Thanks
Laszlo
Post by Laszlo Ersek
Post by Laszlo Ersek
Contributed-under: TianoCore Contribution Agreement 1.0
---
OvmfPkg/PlatformPei/MemDetect.c | 43 +++++++++++++++++++++++++++++++++++++----
1 file changed, 39 insertions(+), 4 deletions(-)
diff --git a/OvmfPkg/PlatformPei/MemDetect.c b/OvmfPkg/PlatformPei/MemDetect.c
index 3ceb142..cceab22 100644
--- a/OvmfPkg/PlatformPei/MemDetect.c
+++ b/OvmfPkg/PlatformPei/MemDetect.c
@@ -194,6 +194,8 @@ QemuInitializeRam (
{
UINT64 LowerMemorySize;
UINT64 UpperMemorySize;
+ MTRR_SETTINGS MtrrSettings;
+ EFI_STATUS Status;
DEBUG ((EFI_D_INFO, "%a called\n", __FUNCTION__));
@@ -214,12 +216,45 @@ QemuInitializeRam (
}
}
- MtrrSetMemoryAttribute (BASE_1MB, LowerMemorySize - BASE_1MB, CacheWriteBack);
+ //
+ // - [640 KB, 1 MB)
+ // - [LowerMemorySize, 4 GB)
+ //
+ // Everything else should be WB. Unfortunately, programming the inverse (ie.
+ // keeping the default UC, and configuring the complement set of the above as
+ // WB) is not reliable in general, because the end of the upper RAM can have
+ // practically any alignment, and we may not have enough variable MTRRs to
+ // cover it exactly.
+ //
+ if (IsMtrrSupported ()) {
+ MtrrGetAllMtrrs (&MtrrSettings);
- MtrrSetMemoryAttribute (0, BASE_512KB + BASE_128KB, CacheWriteBack);
+ //
+ // MTRRs disabled, fixed MTRRs disabled, default type is uncached
+ //
+ ASSERT ((MtrrSettings.MtrrDefType & BIT11) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & BIT10) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & 0xFF) == 0);
- if (UpperMemorySize != 0) {
- MtrrSetMemoryAttribute (BASE_4GB, UpperMemorySize, CacheWriteBack);
+ //
+ // flip default type to writeback
+ //
+ SetMem (&MtrrSettings.Fixed, sizeof MtrrSettings.Fixed, 0x06);
+ ZeroMem (&MtrrSettings.Variables, sizeof MtrrSettings.Variables);
+ MtrrSettings.MtrrDefType |= BIT11 | BIT10 | 6;
+ MtrrSetAllMtrrs (&MtrrSettings);
+
+ //
+ // punch holes
+ //
+ Status = MtrrSetMemoryAttribute (BASE_512KB + BASE_128KB,
+ SIZE_256KB + SIZE_128KB, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
+
+ Status = MtrrSetMemoryAttribute (LowerMemorySize,
+ SIZE_4GB - LowerMemorySize, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
}
}
------------------------------------------------------------------------------
Maoming
2015-06-15 13:25:05 UTC
Permalink
Hi :
Sorry for the late reply.
I tested the patch series using 64G and 80G.
Both of them are OK in XEN.

Here is what it looks like inside the VM (the memory is 80G):
total used free shared buffers cached
Mem: 81956412 654708 81301704 0 10528 42256
-/+ buffers/cache: 601924 81354488
Swap: 4186108 0 4186108

Thanks a lot for your nice work!
Maoming


-----邮件原件-----
发件人: Laszlo Ersek [mailto:***@redhat.com]
发送时间: 2015年6月10日 21:03
收件人: Maoming
抄送: edk2-***@lists.sourceforge.net; Huangpeng (Peter); Wei Liu; Paolo Bonzini
主题: Re: [edk2] [RFC 4/4] OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam()
Post by Laszlo Ersek
Post by Laszlo Ersek
At the moment we work with a UC default MTRR type, and set three
- [0, 640 KB),
- [1 MB, LowerMemorySize),
- [4 GB, 4 GB + UpperMemorySize).
Unfortunately, coverage for the third range can fail with a high
likelihood. If the alignment of the base (ie. 4 GB) and the alignment
of the size (UpperMemorySize) differ, then MtrrLib creates a series
of variable MTRR entries, with power-of-two sized MTRR masks. And,
it's really easy to run out of variable MTRR entries, dependent on
the alignment difference.
This is a problem because a Linux guest will loudly reject any high
memory that is not covered my MTRR.
- flip the MTRR default type to WB,
- set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default
type and variable MTRRs, so we can't avoid this,
- set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs,
- set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs
more likely than the other scheme (due to less chaotic alignment
differences).
Effects of this patch can be observed by setting DEBUG_CACHE
(0x00200000) in PcdDebugPrintErrorLevel.
BUG: Although the MTRRs look good to me in the OVMF debug log, I
still can't boot >= 64 GB guests with this. Instead of the complaints
mentioned above, the Linux guest apparently spirals into an infinite
loop (on KVM), or hangs with no CPU load (on TCG).
No, actually there is no bug in this patch (so s/RFC/PATCH/). I did
- I can reproduce the same issue on KVM with SeaBIOS guests.
- The exact symptoms are that as soon as the highest guest-phys address
is >= 64 GB, then the guest kernel doesn't boot. It gets stuck
somewhere after hitting Enter in grub.
- Normally 3 GB of the guest RAM is mapped under 4 GB in guest-phys
address space, then there's a 1 GB PCI hole, and the rest is above
4 GB. This means that a 63 GB guest can be started (because 63 - 3 + 4
== 64), but if you add just 1 MB more, it won't boot.
- (This was the big discovery:) I flipped the "ept" parameter of the
kvm_intel module on my host to N, and then things started to work. I
just booted a 128 GB Linux guest with this patchset. (I have 4 GB
RAM in my host, plus approx 250 GB swap.) The guest could see it all.
- The TCG boot didn't hang either; I just couldn't wait earlier for
network initialization to complete.
I'm CC'ing Paolo for help with the EPT question. Other than that, this
series is functional. (For QEMU/KVM at least; Xen will likely need
more fixes from others.)
We have a root cause, it seems. The issue is that the processor in my laptop, on which I tested, has only 36 bits for physical addresses:

$ grep 'address sizes' /proc/cpuinfo
address sizes : 36 bits physical, 48 bits virtual
...

Which matches where the problem surfaces (64 GB guest-phys address
space) with hw-supported nested paging (EPT) enabled on the host.

In order to confirm this, a colleague of mine gave me access to a server with 96 GB of RAM, and:

address sizes : 46 bits physical, 48 bits virtual

On this host I booted a 72 GB OVMF guest on QEMU/KVM, with EPT enabled, and according to the guest dmesg, the guest saw it all.

Memory: 74160924K/75493820K available (7735K kernel code, 1149K
rwdata, 3340K rodata, 1500K init, 1524K bss, 1332896K reserved, 0K
cma-reserved)

Maoming: since you reported this issue, please confirm that the patch series resolves it for you as well. In that case, I'll repost the series with "PATCH" as subject-prefix instead of "RFC", and I'll drop the BUG note from the last commit message.

Thanks
Laszlo
Post by Laszlo Ersek
Post by Laszlo Ersek
Contributed-under: TianoCore Contribution Agreement 1.0
---
OvmfPkg/PlatformPei/MemDetect.c | 43
+++++++++++++++++++++++++++++++++++++----
1 file changed, 39 insertions(+), 4 deletions(-)
diff --git a/OvmfPkg/PlatformPei/MemDetect.c
b/OvmfPkg/PlatformPei/MemDetect.c index 3ceb142..cceab22 100644
--- a/OvmfPkg/PlatformPei/MemDetect.c
+++ b/OvmfPkg/PlatformPei/MemDetect.c
@@ -194,6 +194,8 @@ QemuInitializeRam ( {
UINT64 LowerMemorySize;
UINT64 UpperMemorySize;
+ MTRR_SETTINGS MtrrSettings;
+ EFI_STATUS Status;
DEBUG ((EFI_D_INFO, "%a called\n", __FUNCTION__));
@@ -214,12 +216,45 @@ QemuInitializeRam (
}
}
- MtrrSetMemoryAttribute (BASE_1MB, LowerMemorySize - BASE_1MB, CacheWriteBack);
+ //
+ // - [640 KB, 1 MB)
+ // - [LowerMemorySize, 4 GB)
+ //
+ // Everything else should be WB. Unfortunately, programming the inverse (ie.
+ // keeping the default UC, and configuring the complement set of
+ the above as // WB) is not reliable in general, because the end of
+ the upper RAM can have // practically any alignment, and we may
+ not have enough variable MTRRs to // cover it exactly.
+ //
+ if (IsMtrrSupported ()) {
+ MtrrGetAllMtrrs (&MtrrSettings);
- MtrrSetMemoryAttribute (0, BASE_512KB + BASE_128KB,
CacheWriteBack);
+ //
+ // MTRRs disabled, fixed MTRRs disabled, default type is uncached
+ //
+ ASSERT ((MtrrSettings.MtrrDefType & BIT11) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & BIT10) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & 0xFF) == 0);
- if (UpperMemorySize != 0) {
- MtrrSetMemoryAttribute (BASE_4GB, UpperMemorySize, CacheWriteBack);
+ //
+ // flip default type to writeback
+ //
+ SetMem (&MtrrSettings.Fixed, sizeof MtrrSettings.Fixed, 0x06);
+ ZeroMem (&MtrrSettings.Variables, sizeof MtrrSettings.Variables);
+ MtrrSettings.MtrrDefType |= BIT11 | BIT10 | 6;
+ MtrrSetAllMtrrs (&MtrrSettings);
+
+ //
+ // punch holes
+ //
+ Status = MtrrSetMemoryAttribute (BASE_512KB + BASE_128KB,
+ SIZE_256KB + SIZE_128KB, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
+
+ Status = MtrrSetMemoryAttribute (LowerMemorySize,
+ SIZE_4GB - LowerMemorySize, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
}
}
------------------------------------------------------------------------------
Laszlo Ersek
2015-06-15 14:07:34 UTC
Permalink
Post by Maoming
Sorry for the late reply.
I tested the patch series using 64G and 80G.
Both of them are OK in XEN.
total used free shared buffers cached
Mem: 81956412 654708 81301704 0 10528 42256
-/+ buffers/cache: 601924 81354488
Swap: 4186108 0 4186108
Thanks a lot for your nice work!
Maoming
Thanks for reporting back!

Since you mentioned earlier that you encountered the problem on qemu/KVM
too -- can you please give that a whirl as well, with this patch series
in place?

Thank you
Laszlo
Post by Maoming
-----邮件原件-----
发送时间: 2015年6月10日 21:03
收件人: Maoming
主题: Re: [edk2] [RFC 4/4] OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam()
Post by Laszlo Ersek
Post by Laszlo Ersek
At the moment we work with a UC default MTRR type, and set three
- [0, 640 KB),
- [1 MB, LowerMemorySize),
- [4 GB, 4 GB + UpperMemorySize).
Unfortunately, coverage for the third range can fail with a high
likelihood. If the alignment of the base (ie. 4 GB) and the alignment
of the size (UpperMemorySize) differ, then MtrrLib creates a series
of variable MTRR entries, with power-of-two sized MTRR masks. And,
it's really easy to run out of variable MTRR entries, dependent on
the alignment difference.
This is a problem because a Linux guest will loudly reject any high
memory that is not covered my MTRR.
- flip the MTRR default type to WB,
- set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default
type and variable MTRRs, so we can't avoid this,
- set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs,
- set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs
more likely than the other scheme (due to less chaotic alignment
differences).
Effects of this patch can be observed by setting DEBUG_CACHE
(0x00200000) in PcdDebugPrintErrorLevel.
BUG: Although the MTRRs look good to me in the OVMF debug log, I
still can't boot >= 64 GB guests with this. Instead of the complaints
mentioned above, the Linux guest apparently spirals into an infinite
loop (on KVM), or hangs with no CPU load (on TCG).
No, actually there is no bug in this patch (so s/RFC/PATCH/). I did
- I can reproduce the same issue on KVM with SeaBIOS guests.
- The exact symptoms are that as soon as the highest guest-phys address
is >= 64 GB, then the guest kernel doesn't boot. It gets stuck
somewhere after hitting Enter in grub.
- Normally 3 GB of the guest RAM is mapped under 4 GB in guest-phys
address space, then there's a 1 GB PCI hole, and the rest is above
4 GB. This means that a 63 GB guest can be started (because 63 - 3 + 4
== 64), but if you add just 1 MB more, it won't boot.
- (This was the big discovery:) I flipped the "ept" parameter of the
kvm_intel module on my host to N, and then things started to work. I
just booted a 128 GB Linux guest with this patchset. (I have 4 GB
RAM in my host, plus approx 250 GB swap.) The guest could see it all.
- The TCG boot didn't hang either; I just couldn't wait earlier for
network initialization to complete.
I'm CC'ing Paolo for help with the EPT question. Other than that, this
series is functional. (For QEMU/KVM at least; Xen will likely need
more fixes from others.)
$ grep 'address sizes' /proc/cpuinfo
address sizes : 36 bits physical, 48 bits virtual
...
Which matches where the problem surfaces (64 GB guest-phys address
space) with hw-supported nested paging (EPT) enabled on the host.
address sizes : 46 bits physical, 48 bits virtual
On this host I booted a 72 GB OVMF guest on QEMU/KVM, with EPT enabled, and according to the guest dmesg, the guest saw it all.
Memory: 74160924K/75493820K available (7735K kernel code, 1149K
rwdata, 3340K rodata, 1500K init, 1524K bss, 1332896K reserved, 0K
cma-reserved)
Maoming: since you reported this issue, please confirm that the patch series resolves it for you as well. In that case, I'll repost the series with "PATCH" as subject-prefix instead of "RFC", and I'll drop the BUG note from the last commit message.
Thanks
Laszlo
Post by Laszlo Ersek
Post by Laszlo Ersek
Contributed-under: TianoCore Contribution Agreement 1.0
---
OvmfPkg/PlatformPei/MemDetect.c | 43
+++++++++++++++++++++++++++++++++++++----
1 file changed, 39 insertions(+), 4 deletions(-)
diff --git a/OvmfPkg/PlatformPei/MemDetect.c
b/OvmfPkg/PlatformPei/MemDetect.c index 3ceb142..cceab22 100644
--- a/OvmfPkg/PlatformPei/MemDetect.c
+++ b/OvmfPkg/PlatformPei/MemDetect.c
@@ -194,6 +194,8 @@ QemuInitializeRam ( {
UINT64 LowerMemorySize;
UINT64 UpperMemorySize;
+ MTRR_SETTINGS MtrrSettings;
+ EFI_STATUS Status;
DEBUG ((EFI_D_INFO, "%a called\n", __FUNCTION__));
@@ -214,12 +216,45 @@ QemuInitializeRam (
}
}
- MtrrSetMemoryAttribute (BASE_1MB, LowerMemorySize - BASE_1MB, CacheWriteBack);
+ //
+ // - [640 KB, 1 MB)
+ // - [LowerMemorySize, 4 GB)
+ //
+ // Everything else should be WB. Unfortunately, programming the inverse (ie.
+ // keeping the default UC, and configuring the complement set of
+ the above as // WB) is not reliable in general, because the end of
+ the upper RAM can have // practically any alignment, and we may
+ not have enough variable MTRRs to // cover it exactly.
+ //
+ if (IsMtrrSupported ()) {
+ MtrrGetAllMtrrs (&MtrrSettings);
- MtrrSetMemoryAttribute (0, BASE_512KB + BASE_128KB,
CacheWriteBack);
+ //
+ // MTRRs disabled, fixed MTRRs disabled, default type is uncached
+ //
+ ASSERT ((MtrrSettings.MtrrDefType & BIT11) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & BIT10) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & 0xFF) == 0);
- if (UpperMemorySize != 0) {
- MtrrSetMemoryAttribute (BASE_4GB, UpperMemorySize, CacheWriteBack);
+ //
+ // flip default type to writeback
+ //
+ SetMem (&MtrrSettings.Fixed, sizeof MtrrSettings.Fixed, 0x06);
+ ZeroMem (&MtrrSettings.Variables, sizeof MtrrSettings.Variables);
+ MtrrSettings.MtrrDefType |= BIT11 | BIT10 | 6;
+ MtrrSetAllMtrrs (&MtrrSettings);
+
+ //
+ // punch holes
+ //
+ Status = MtrrSetMemoryAttribute (BASE_512KB + BASE_128KB,
+ SIZE_256KB + SIZE_128KB, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
+
+ Status = MtrrSetMemoryAttribute (LowerMemorySize,
+ SIZE_4GB - LowerMemorySize, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
}
}
------------------------------------------------------------------------------
Maoming
2015-06-16 12:54:13 UTC
Permalink
-----邮件原件-----
发件人: Laszlo Ersek [mailto:***@redhat.com]
发送时间: 2015年6月15日 22:08
收件人: Maoming
抄送: edk2-***@lists.sourceforge.net; Huangpeng (Peter); Wei Liu; Paolo Bonzini
主题: Re: 答复: [edk2] [RFC 4/4] OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam()
Post by Maoming
Sorry for the late reply.
I tested the patch series using 64G and 80G.
Both of them are OK in XEN.
total used free shared buffers cached
Mem: 81956412 654708 81301704 0 10528 42256
-/+ buffers/cache: 601924 81354488
Swap: 4186108 0 4186108
Thanks a lot for your nice work!
Maoming
Thanks for reporting back!

Since you mentioned earlier that you encountered the problem on qemu/KVM
too -- can you please give that a whirl as well, with this patch series
in place?

Thank you
Laszlo


The patch series works well in KVM too.
My environment is :
version: kvm-kmod-3.6
QEMU emulator version 2.1.0

Here is what it looks like inside the VM (the memory is 90G):
total used free shared buffers cached
Mem: 92862616 1155156 91707460 0 13552 77952
-/+ buffers/cache: 1063652 91798964
Swap: 4063224 0 4063224

Thanks!
Maoming
Post by Maoming
-----邮件原件-----
发送时间: 2015年6月10日 21:03
收件人: Maoming
主题: Re: [edk2] [RFC 4/4] OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam()
Post by Laszlo Ersek
Post by Laszlo Ersek
At the moment we work with a UC default MTRR type, and set three
- [0, 640 KB),
- [1 MB, LowerMemorySize),
- [4 GB, 4 GB + UpperMemorySize).
Unfortunately, coverage for the third range can fail with a high
likelihood. If the alignment of the base (ie. 4 GB) and the alignment
of the size (UpperMemorySize) differ, then MtrrLib creates a series
of variable MTRR entries, with power-of-two sized MTRR masks. And,
it's really easy to run out of variable MTRR entries, dependent on
the alignment difference.
This is a problem because a Linux guest will loudly reject any high
memory that is not covered my MTRR.
- flip the MTRR default type to WB,
- set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default
type and variable MTRRs, so we can't avoid this,
- set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs,
- set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs
more likely than the other scheme (due to less chaotic alignment
differences).
Effects of this patch can be observed by setting DEBUG_CACHE
(0x00200000) in PcdDebugPrintErrorLevel.
BUG: Although the MTRRs look good to me in the OVMF debug log, I
still can't boot >= 64 GB guests with this. Instead of the complaints
mentioned above, the Linux guest apparently spirals into an infinite
loop (on KVM), or hangs with no CPU load (on TCG).
No, actually there is no bug in this patch (so s/RFC/PATCH/). I did
- I can reproduce the same issue on KVM with SeaBIOS guests.
- The exact symptoms are that as soon as the highest guest-phys address
is >= 64 GB, then the guest kernel doesn't boot. It gets stuck
somewhere after hitting Enter in grub.
- Normally 3 GB of the guest RAM is mapped under 4 GB in guest-phys
address space, then there's a 1 GB PCI hole, and the rest is above
4 GB. This means that a 63 GB guest can be started (because 63 - 3 + 4
== 64), but if you add just 1 MB more, it won't boot.
- (This was the big discovery:) I flipped the "ept" parameter of the
kvm_intel module on my host to N, and then things started to work. I
just booted a 128 GB Linux guest with this patchset. (I have 4 GB
RAM in my host, plus approx 250 GB swap.) The guest could see it all.
- The TCG boot didn't hang either; I just couldn't wait earlier for
network initialization to complete.
I'm CC'ing Paolo for help with the EPT question. Other than that, this
series is functional. (For QEMU/KVM at least; Xen will likely need
more fixes from others.)
$ grep 'address sizes' /proc/cpuinfo
address sizes : 36 bits physical, 48 bits virtual
...
Which matches where the problem surfaces (64 GB guest-phys address
space) with hw-supported nested paging (EPT) enabled on the host.
address sizes : 46 bits physical, 48 bits virtual
On this host I booted a 72 GB OVMF guest on QEMU/KVM, with EPT enabled, and according to the guest dmesg, the guest saw it all.
Memory: 74160924K/75493820K available (7735K kernel code, 1149K
rwdata, 3340K rodata, 1500K init, 1524K bss, 1332896K reserved, 0K
cma-reserved)
Maoming: since you reported this issue, please confirm that the patch series resolves it for you as well. In that case, I'll repost the series with "PATCH" as subject-prefix instead of "RFC", and I'll drop the BUG note from the last commit message.
Thanks
Laszlo
Post by Laszlo Ersek
Post by Laszlo Ersek
Contributed-under: TianoCore Contribution Agreement 1.0
---
OvmfPkg/PlatformPei/MemDetect.c | 43
+++++++++++++++++++++++++++++++++++++----
1 file changed, 39 insertions(+), 4 deletions(-)
diff --git a/OvmfPkg/PlatformPei/MemDetect.c
b/OvmfPkg/PlatformPei/MemDetect.c index 3ceb142..cceab22 100644
--- a/OvmfPkg/PlatformPei/MemDetect.c
+++ b/OvmfPkg/PlatformPei/MemDetect.c
@@ -194,6 +194,8 @@ QemuInitializeRam ( {
UINT64 LowerMemorySize;
UINT64 UpperMemorySize;
+ MTRR_SETTINGS MtrrSettings;
+ EFI_STATUS Status;
DEBUG ((EFI_D_INFO, "%a called\n", __FUNCTION__));
@@ -214,12 +216,45 @@ QemuInitializeRam (
}
}
- MtrrSetMemoryAttribute (BASE_1MB, LowerMemorySize - BASE_1MB, CacheWriteBack);
+ //
+ // - [640 KB, 1 MB)
+ // - [LowerMemorySize, 4 GB)
+ //
+ // Everything else should be WB. Unfortunately, programming the inverse (ie.
+ // keeping the default UC, and configuring the complement set of
+ the above as // WB) is not reliable in general, because the end of
+ the upper RAM can have // practically any alignment, and we may
+ not have enough variable MTRRs to // cover it exactly.
+ //
+ if (IsMtrrSupported ()) {
+ MtrrGetAllMtrrs (&MtrrSettings);
- MtrrSetMemoryAttribute (0, BASE_512KB + BASE_128KB,
CacheWriteBack);
+ //
+ // MTRRs disabled, fixed MTRRs disabled, default type is uncached
+ //
+ ASSERT ((MtrrSettings.MtrrDefType & BIT11) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & BIT10) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & 0xFF) == 0);
- if (UpperMemorySize != 0) {
- MtrrSetMemoryAttribute (BASE_4GB, UpperMemorySize, CacheWriteBack);
+ //
+ // flip default type to writeback
+ //
+ SetMem (&MtrrSettings.Fixed, sizeof MtrrSettings.Fixed, 0x06);
+ ZeroMem (&MtrrSettings.Variables, sizeof MtrrSettings.Variables);
+ MtrrSettings.MtrrDefType |= BIT11 | BIT10 | 6;
+ MtrrSetAllMtrrs (&MtrrSettings);
+
+ //
+ // punch holes
+ //
+ Status = MtrrSetMemoryAttribute (BASE_512KB + BASE_128KB,
+ SIZE_256KB + SIZE_128KB, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
+
+ Status = MtrrSetMemoryAttribute (LowerMemorySize,
+ SIZE_4GB - LowerMemorySize, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
}
}
------------------------------------------------------------------------------
Laszlo Ersek
2015-06-16 14:51:32 UTC
Permalink
Post by Maoming
-----邮件原件-----
发送时间: 2015年6月15日 22:08
收件人: Maoming
主题: Re: 答复: [edk2] [RFC 4/4] OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam()
Post by Maoming
Sorry for the late reply.
I tested the patch series using 64G and 80G.
Both of them are OK in XEN.
total used free shared buffers cached
Mem: 81956412 654708 81301704 0 10528 42256
-/+ buffers/cache: 601924 81354488
Swap: 4186108 0 4186108
Thanks a lot for your nice work!
Maoming
Thanks for reporting back!
Since you mentioned earlier that you encountered the problem on qemu/KVM
too -- can you please give that a whirl as well, with this patch series
in place?
Thank you
Laszlo
The patch series works well in KVM too.
version: kvm-kmod-3.6
QEMU emulator version 2.1.0
total used free shared buffers cached
Mem: 92862616 1155156 91707460 0 13552 77952
-/+ buffers/cache: 1063652 91798964
Swap: 4063224 0 4063224
Thanks!
Maoming
Great, thank you.

I'll add Wei Liu's Tested-by to patches #1 and #2 (because the other two
patches don't affect Xen), and I will add your Tested-by to all four
patches. I'll update the commit message of patch #4 and I'll resend the
series as PATCH, not RFC.

Cheers!
Laszlo
Post by Maoming
Post by Maoming
-----邮件原件-----
发送时间: 2015年6月10日 21:03
收件人: Maoming
主题: Re: [edk2] [RFC 4/4] OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam()
Post by Laszlo Ersek
Post by Laszlo Ersek
At the moment we work with a UC default MTRR type, and set three
- [0, 640 KB),
- [1 MB, LowerMemorySize),
- [4 GB, 4 GB + UpperMemorySize).
Unfortunately, coverage for the third range can fail with a high
likelihood. If the alignment of the base (ie. 4 GB) and the alignment
of the size (UpperMemorySize) differ, then MtrrLib creates a series
of variable MTRR entries, with power-of-two sized MTRR masks. And,
it's really easy to run out of variable MTRR entries, dependent on
the alignment difference.
This is a problem because a Linux guest will loudly reject any high
memory that is not covered my MTRR.
- flip the MTRR default type to WB,
- set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default
type and variable MTRRs, so we can't avoid this,
- set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs,
- set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs
more likely than the other scheme (due to less chaotic alignment
differences).
Effects of this patch can be observed by setting DEBUG_CACHE
(0x00200000) in PcdDebugPrintErrorLevel.
BUG: Although the MTRRs look good to me in the OVMF debug log, I
still can't boot >= 64 GB guests with this. Instead of the complaints
mentioned above, the Linux guest apparently spirals into an infinite
loop (on KVM), or hangs with no CPU load (on TCG).
No, actually there is no bug in this patch (so s/RFC/PATCH/). I did
- I can reproduce the same issue on KVM with SeaBIOS guests.
- The exact symptoms are that as soon as the highest guest-phys address
is >= 64 GB, then the guest kernel doesn't boot. It gets stuck
somewhere after hitting Enter in grub.
- Normally 3 GB of the guest RAM is mapped under 4 GB in guest-phys
address space, then there's a 1 GB PCI hole, and the rest is above
4 GB. This means that a 63 GB guest can be started (because 63 - 3 + 4
== 64), but if you add just 1 MB more, it won't boot.
- (This was the big discovery:) I flipped the "ept" parameter of the
kvm_intel module on my host to N, and then things started to work. I
just booted a 128 GB Linux guest with this patchset. (I have 4 GB
RAM in my host, plus approx 250 GB swap.) The guest could see it all.
- The TCG boot didn't hang either; I just couldn't wait earlier for
network initialization to complete.
I'm CC'ing Paolo for help with the EPT question. Other than that, this
series is functional. (For QEMU/KVM at least; Xen will likely need
more fixes from others.)
$ grep 'address sizes' /proc/cpuinfo
address sizes : 36 bits physical, 48 bits virtual
...
Which matches where the problem surfaces (64 GB guest-phys address
space) with hw-supported nested paging (EPT) enabled on the host.
address sizes : 46 bits physical, 48 bits virtual
On this host I booted a 72 GB OVMF guest on QEMU/KVM, with EPT enabled, and according to the guest dmesg, the guest saw it all.
Memory: 74160924K/75493820K available (7735K kernel code, 1149K
rwdata, 3340K rodata, 1500K init, 1524K bss, 1332896K reserved, 0K
cma-reserved)
Maoming: since you reported this issue, please confirm that the patch series resolves it for you as well. In that case, I'll repost the series with "PATCH" as subject-prefix instead of "RFC", and I'll drop the BUG note from the last commit message.
Thanks
Laszlo
Post by Laszlo Ersek
Post by Laszlo Ersek
Contributed-under: TianoCore Contribution Agreement 1.0
---
OvmfPkg/PlatformPei/MemDetect.c | 43
+++++++++++++++++++++++++++++++++++++----
1 file changed, 39 insertions(+), 4 deletions(-)
diff --git a/OvmfPkg/PlatformPei/MemDetect.c
b/OvmfPkg/PlatformPei/MemDetect.c index 3ceb142..cceab22 100644
--- a/OvmfPkg/PlatformPei/MemDetect.c
+++ b/OvmfPkg/PlatformPei/MemDetect.c
@@ -194,6 +194,8 @@ QemuInitializeRam ( {
UINT64 LowerMemorySize;
UINT64 UpperMemorySize;
+ MTRR_SETTINGS MtrrSettings;
+ EFI_STATUS Status;
DEBUG ((EFI_D_INFO, "%a called\n", __FUNCTION__));
@@ -214,12 +216,45 @@ QemuInitializeRam (
}
}
- MtrrSetMemoryAttribute (BASE_1MB, LowerMemorySize - BASE_1MB, CacheWriteBack);
+ //
+ // - [640 KB, 1 MB)
+ // - [LowerMemorySize, 4 GB)
+ //
+ // Everything else should be WB. Unfortunately, programming the inverse (ie.
+ // keeping the default UC, and configuring the complement set of
+ the above as // WB) is not reliable in general, because the end of
+ the upper RAM can have // practically any alignment, and we may
+ not have enough variable MTRRs to // cover it exactly.
+ //
+ if (IsMtrrSupported ()) {
+ MtrrGetAllMtrrs (&MtrrSettings);
- MtrrSetMemoryAttribute (0, BASE_512KB + BASE_128KB,
CacheWriteBack);
+ //
+ // MTRRs disabled, fixed MTRRs disabled, default type is uncached
+ //
+ ASSERT ((MtrrSettings.MtrrDefType & BIT11) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & BIT10) == 0);
+ ASSERT ((MtrrSettings.MtrrDefType & 0xFF) == 0);
- if (UpperMemorySize != 0) {
- MtrrSetMemoryAttribute (BASE_4GB, UpperMemorySize, CacheWriteBack);
+ //
+ // flip default type to writeback
+ //
+ SetMem (&MtrrSettings.Fixed, sizeof MtrrSettings.Fixed, 0x06);
+ ZeroMem (&MtrrSettings.Variables, sizeof MtrrSettings.Variables);
+ MtrrSettings.MtrrDefType |= BIT11 | BIT10 | 6;
+ MtrrSetAllMtrrs (&MtrrSettings);
+
+ //
+ // punch holes
+ //
+ Status = MtrrSetMemoryAttribute (BASE_512KB + BASE_128KB,
+ SIZE_256KB + SIZE_128KB, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
+
+ Status = MtrrSetMemoryAttribute (LowerMemorySize,
+ SIZE_4GB - LowerMemorySize, CacheUncacheable);
+ ASSERT_EFI_ERROR (Status);
}
}
------------------------------------------------------------------------------
Wei Liu
2015-06-10 14:18:50 UTC
Permalink
Post by Laszlo Ersek
This is in response to
http://thread.gmane.org/gmane.comp.bios.tianocore.devel/15253
http://thread.gmane.org/gmane.comp.emulators.xen.devel/245623
It's an RFC only because the last patch doesn't actually work. Ideas
welcome.
Thanks
Laszlo
Thanks Laszlo.

I think only the first two patches matter to Xen so I only tested those.
Now I'm able to boot a guest with more than 64GB ram.

Here is what it looks like from a 128 GB guest.

***@debianhvm:~# free -m
total used free shared buffers cached
Mem: 12640 116 12523 0 6 27
-/+ buffers/cache: 82 12558
Swap: 26486 0 26486

Nice work!

Wei.
Post by Laszlo Ersek
OvmfPkg: PlatformPei: enable larger permanent PEI RAM
OvmfPkg: PlatformPei: create the CPU HOB with dynamic memory space
width
OvmfPkg: PlatformPei: beautify memory HOB order in QemuInitializeRam()
OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam()
OvmfPkg/PlatformPei/Platform.h | 7 +++
OvmfPkg/PlatformPei/MemDetect.c | 115 +++++++++++++++++++++++++++++++++++-----
OvmfPkg/PlatformPei/Platform.c | 7 ++-
3 files changed, 114 insertions(+), 15 deletions(-)
--
1.8.3.1
------------------------------------------------------------------------------
Wei Liu
2015-06-10 15:06:52 UTC
Permalink
Post by Wei Liu
Post by Laszlo Ersek
This is in response to
http://thread.gmane.org/gmane.comp.bios.tianocore.devel/15253
http://thread.gmane.org/gmane.comp.emulators.xen.devel/245623
It's an RFC only because the last patch doesn't actually work. Ideas
welcome.
Thanks
Laszlo
Thanks Laszlo.
I think only the first two patches matter to Xen so I only tested those.
Now I'm able to boot a guest with more than 64GB ram.
Here is what it looks like from a 128 GB guest.
total used free shared buffers cached
Mem: 12640 116 12523 0 6 27
-/+ buffers/cache: 82 12558
Swap: 26486 0 26486
Spoke too soon and miscounted. It was only 12GB of ram instead of 128GB.
I need to investigate this a bit more.

But anyway this is improvement from previous situation.

Wei.
Post by Wei Liu
Nice work!
Wei.
Post by Laszlo Ersek
OvmfPkg: PlatformPei: enable larger permanent PEI RAM
OvmfPkg: PlatformPei: create the CPU HOB with dynamic memory space
width
OvmfPkg: PlatformPei: beautify memory HOB order in QemuInitializeRam()
OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam()
OvmfPkg/PlatformPei/Platform.h | 7 +++
OvmfPkg/PlatformPei/MemDetect.c | 115 +++++++++++++++++++++++++++++++++++-----
OvmfPkg/PlatformPei/Platform.c | 7 ++-
3 files changed, 114 insertions(+), 15 deletions(-)
--
1.8.3.1
------------------------------------------------------------------------------
Wei Liu
2015-06-10 15:20:45 UTC
Permalink
Post by Wei Liu
Post by Laszlo Ersek
This is in response to
http://thread.gmane.org/gmane.comp.bios.tianocore.devel/15253
http://thread.gmane.org/gmane.comp.emulators.xen.devel/245623
It's an RFC only because the last patch doesn't actually work. Ideas
welcome.
Thanks
Laszlo
Thanks Laszlo.
I think only the first two patches matter to Xen so I only tested those.
Now I'm able to boot a guest with more than 64GB ram.
Here is what it looks like from a 128 GB guest.
total used free shared buffers cached
Mem: 12640 116 12523 0 6 27
-/+ buffers/cache: 82 12558
Swap: 26486 0 26486
The above output is wrong because I mistyped guest memory size in
configuration (131072 vs 13172).

Here is the real thing:

***@debianhvm:~# free -m
total used free shared buffers cached
Mem: 128928 122424 6503 0 7 121883
-/+ buffers/cache: 532 128395
Swap: 26486 0 26486
***@debianhvm:~#

Wei.
Post by Wei Liu
Nice work!
Wei.
Post by Laszlo Ersek
OvmfPkg: PlatformPei: enable larger permanent PEI RAM
OvmfPkg: PlatformPei: create the CPU HOB with dynamic memory space
width
OvmfPkg: PlatformPei: beautify memory HOB order in QemuInitializeRam()
OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam()
OvmfPkg/PlatformPei/Platform.h | 7 +++
OvmfPkg/PlatformPei/MemDetect.c | 115 +++++++++++++++++++++++++++++++++++-----
OvmfPkg/PlatformPei/Platform.c | 7 ++-
3 files changed, 114 insertions(+), 15 deletions(-)
--
1.8.3.1
------------------------------------------------------------------------------
Loading...