[edk2] [PATCH v4 00/10] Support UTF-8 in .uni string files

Discussion:

Jordan Justen

2015-06-04 06:42:11 UTC

https://github.com/jljusten/edk2.git utf8-v4

v4:
* Reject characters in the range 0xd800 - 0xdfff since they are
reserved for UTF-16 Surrogate Pairs. (lersek)
* Add more tests to verify the Surrogate Pairs range is rejected, and
that UTF-8 BOM is allowed. (lersek)

v3:
* v2 fixed the USC-2 issue with UTF-16 file by 'accident'. Now this
is done in separate patches. (Patches 3 & 4)
* Validate entire file by loading the entire contents (mdkinney)
* Add stub version of ucs-2 codec to verify unicode file contents are
valid USC-2 characters.

v2:
* Drop .utf8 extension. Use .uni file for UTF-8 data (mdkinney)

The UTF-16 .uni files are fairly annoying to work with:
* They must be checked in as 'binary' files
* It is difficult to produce a diff of changes
* UTF-8 is more likely to be supported by text editors

This series allows .uni files to contain UTF-8 (or, as before, UTF-16)
string data. If the UTF-16 LE or BE BOM is found, then the file is
read as UTF-16. Otherwise, it is treated as UTF-8.

Jordan Justen (10):
BaseTools/Tests: Always add BaseTools source to import path
BaseTools/EdkLogger: Support unit tests with a SILENT log level
BaseTools/Tests: Add unit test for AutoGen.UniClassObject
BaseTools/UniClassObject: Verify valid UCS-2 chars in UTF-16 .uni
files
BaseTools/Tests: Verify unsupported UTF-16 are rejected
BaseTools/UniClassObject: Support UTF-8 string data in .uni files
BaseTools/Tests: Verify 32-bit UTF-8 chars are rejected
BaseTools/Tests: Verify unsupported UTF-8 data is rejected
BaseTools/Tests: Verify supported UTF-8 data is allowed
OvmfPkg/PlatformDxe: Convert Platform.uni to UTF-8

BaseTools/Source/Python/AutoGen/UniClassObject.py | 91 ++++++++++-
BaseTools/Source/Python/Common/EdkLogger.py | 9 +-
BaseTools/Tests/CheckUnicodeSourceFiles.py | 181 ++++++++++++++++++++++
BaseTools/Tests/PythonToolsTests.py | 4 +-
BaseTools/Tests/RunTests.py | 2 -
BaseTools/Tests/TestTools.py | 9 +-
OvmfPkg/PlatformDxe/Platform.uni | Bin 3298 -> 1648 bytes
7 files changed, 289 insertions(+), 7 deletions(-)
create mode 100644 BaseTools/Tests/CheckUnicodeSourceFiles.py

--
2.1.4

------------------------------------------------------------------------------

Jordan Justen

2015-06-04 06:42:13 UTC

Permalink

This allows the unit tests to run without the errors logging to the
screen.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
---
BaseTools/Source/Python/Common/EdkLogger.py | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/BaseTools/Source/Python/Common/EdkLogger.py b/BaseTools/Source/Python/Common/EdkLogger.py
index f048b61..8d6d426 100644
--- a/BaseTools/Source/Python/Common/EdkLogger.py
+++ b/BaseTools/Source/Python/Common/EdkLogger.py
@@ -32,6 +32,7 @@ INFO = 20
WARN = 30
QUIET = 40
ERROR = 50
+SILENT = 99

IsRaiseError = True

@@ -39,7 +40,9 @@ IsRaiseError = True
_ToolName = os.path.basename(sys.argv[0])

# For validation purpose
-_LogLevels = [DEBUG_0, DEBUG_1, DEBUG_2, DEBUG_3, DEBUG_4, DEBUG_5, DEBUG_6, DEBUG_7, DEBUG_8, DEBUG_9, VERBOSE, WARN, INFO, ERROR, QUIET]
+_LogLevels = [DEBUG_0, DEBUG_1, DEBUG_2, DEBUG_3, DEBUG_4, DEBUG_5,
+ DEBUG_6, DEBUG_7, DEBUG_8, DEBUG_9, VERBOSE, WARN, INFO,
+ ERROR, QUIET, SILENT]

# For DEBUG level (All DEBUG_0~9 are applicable)
_DebugLogger = logging.getLogger("tool_debug")
@@ -235,6 +238,10 @@ def SetLevel(Level):
_InfoLogger.setLevel(Level)
_ErrorLogger.setLevel(Level)

+def InitializeForUnitTest():
+ Initialize()
+ SetLevel(SILENT)
+
## Get current log level
def GetLevel():
return _InfoLogger.getEffectiveLevel()

--
2.1.4

------------------------------------------------------------------------------

Jordan Justen

2015-06-04 06:42:12 UTC

Permalink

This allows unit tests to easily include BaseTools python
modules. This is very useful for writing unit tests.

Actually, previously, we would do this when RunTests.py was executed,
so unit tests could easily import BaseTools modules, so long as they
were executed via RunTests.

This change allows running the unit test files individually which can
be faster for developing the new unit test cases.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
---
BaseTools/Tests/RunTests.py | 2 --
BaseTools/Tests/TestTools.py | 9 ++++++++-
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/BaseTools/Tests/RunTests.py b/BaseTools/Tests/RunTests.py
index e8ca2d0..0dd6563 100644
--- a/BaseTools/Tests/RunTests.py
+++ b/BaseTools/Tests/RunTests.py
@@ -21,8 +21,6 @@ import unittest

import TestTools

-sys.path.append(TestTools.PythonSourceDir)
-
def GetCTestSuite():
import CToolsTests
return CToolsTests.TheTestSuite()
diff --git a/BaseTools/Tests/TestTools.py b/BaseTools/Tests/TestTools.py
index ac009db..27afd79 100644
--- a/BaseTools/Tests/TestTools.py
+++ b/BaseTools/Tests/TestTools.py
@@ -1,7 +1,7 @@
## @file
# Utility functions and classes for BaseTools unit tests
#
-# Copyright (c) 2008 - 2012, Intel Corporation. All rights reserved. 
+# Copyright (c) 2008 - 2015, Intel Corporation. All rights reserved. 
#
# This program and the accompanying materials
# are licensed and made available under the terms and conditions of the BSD License
@@ -31,6 +31,13 @@ CSourceDir = os.path.join(BaseToolsDir, 'Source', 'C')
PythonSourceDir = os.path.join(BaseToolsDir, 'Source', 'Python')
TestTempDir = os.path.join(TestsDir, 'TestTempDir')

+if PythonSourceDir not in sys.path:
+ #
+ # Allow unit tests to import BaseTools python modules. This is very useful
+ # for writing unit tests.
+ #
+ sys.path.append(PythonSourceDir)
+
def MakeTheTestSuite(localItems):
tests = []
for name, item in localItems.iteritems():

--
2.1.4

------------------------------------------------------------------------------

Jordan Justen

2015-06-04 06:42:14 UTC

Permalink

This verifies that a UTF-16 data (with BOM) .uni file is successfully
read.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
---
BaseTools/Tests/CheckUnicodeSourceFiles.py | 88 ++++++++++++++++++++++++++++++
BaseTools/Tests/PythonToolsTests.py | 4 +-
2 files changed, 91 insertions(+), 1 deletion(-)
create mode 100644 BaseTools/Tests/CheckUnicodeSourceFiles.py

diff --git a/BaseTools/Tests/CheckUnicodeSourceFiles.py b/BaseTools/Tests/CheckUnicodeSourceFiles.py
new file mode 100644
index 0000000..0083ad8
--- /dev/null
+++ b/BaseTools/Tests/CheckUnicodeSourceFiles.py
@@ -0,0 +1,88 @@
+## @file
+# Unit tests for AutoGen.UniClassObject
+#
+# Copyright (c) 2015, Intel Corporation. All rights reserved. 
+#
+# This program and the accompanying materials
+# are licensed and made available under the terms and conditions of the BSD License
+# which accompanies this distribution. The full text of the license may be found at
+# http://opensource.org/licenses/bsd-license.php
+#
+# THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS,
+# WITHOUT WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED.
+#
+
+##
+# Import Modules
+#
+import os
+import unittest
+
+import codecs
+
+import TestTools
+
+from Common.Misc import PathClass
+import AutoGen.UniClassObject as BtUni
+
+from Common import EdkLogger
+EdkLogger.InitializeForUnitTest()
+
+class Tests(TestTools.BaseToolsTest):
+
+ SampleData = u'''
+ #langdef en-US "English"
+ #string STR_A #language en-US "STR_A for en-US"
+ '''
+
+ def EncodeToFile(self, encoding, string=None):
+ if string is None:
+ string = self.SampleData
+ data = codecs.encode(string, encoding)
+ path = 'input.uni'
+ self.WriteTmpFile(path, data)
+ return PathClass(self.GetTmpFilePath(path))
+
+ def ErrorFailure(self, error, encoding, shouldPass):
+ msg = error + ' should '
+ if shouldPass:
+ msg += 'not '
+ msg += 'be generated for '
+ msg += '%s data in a .uni file' % encoding
+ self.fail(msg)
+
+ def UnicodeErrorFailure(self, encoding, shouldPass):
+ self.ErrorFailure('UnicodeError', encoding, shouldPass)
+
+ def EdkErrorFailure(self, encoding, shouldPass):
+ self.ErrorFailure('EdkLogger.FatalError', encoding, shouldPass)
+
+ def CheckFile(self, encoding, shouldPass, string=None):
+ path = self.EncodeToFile(encoding, string)
+ try:
+ BtUni.UniFileClassObject([path])
+ if shouldPass:
+ return
+ except UnicodeError:
+ if not shouldPass:
+ return
+ else:
+ self.UnicodeErrorFailure(encoding, shouldPass)
+ except EdkLogger.FatalError:
+ if not shouldPass:
+ return
+ else:
+ self.EdkErrorFailure(encoding, shouldPass)
+ except Exception:
+ pass
+
+ self.EdkErrorFailure(encoding, shouldPass)
+
+ def testUtf16InUniFile(self):
+ self.CheckFile('utf_16', shouldPass=True)
+
+TheTestSuite = TestTools.MakeTheTestSuite(locals())
+
+if __name__ == '__main__':
+ allTests = TheTestSuite()
+ unittest.TextTestRunner().run(allTests)
diff --git a/BaseTools/Tests/PythonToolsTests.py b/BaseTools/Tests/PythonToolsTests.py
index 6096e21..c953daf 100644
--- a/BaseTools/Tests/PythonToolsTests.py
+++ b/BaseTools/Tests/PythonToolsTests.py
@@ -1,7 +1,7 @@
## @file
# Unit tests for Python based BaseTools
#
-# Copyright (c) 2008, Intel Corporation. All rights reserved. 
+# Copyright (c) 2008 - 2015, Intel Corporation. All rights reserved. 
#
# This program and the accompanying materials
# are licensed and made available under the terms and conditions of the BSD License
@@ -24,6 +24,8 @@ def TheTestSuite():
suites = []
import CheckPythonSyntax
suites.append(CheckPythonSyntax.TheTestSuite())
+ import CheckUnicodeSourceFiles
+ suites.append(CheckUnicodeSourceFiles.TheTestSuite())
return unittest.TestSuite(suites)

if __name__ == '__main__':

--
2.1.4

------------------------------------------------------------------------------

Jordan Justen

2015-06-04 06:42:15 UTC

Permalink

Supplementary Plane characters can exist in UTF-16 files,
but they are not valid UCS-2 characters.

Post by Jordan Justen
import codecs
codecs.encode(u'\U00010300', 'utf-16')

'\xff\xfe\x00\xd8\x00\xdf'

Therefore the UCS-4 0x00010300 character is encoded as two
16-bit numbers (0xd800 0xdf00) in a little endian UTF-16
file.

For more information, see:
http://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF

This means that our current BaseTools code could be allowing
unsupported UTF-16 characters be used. To fix this, we decode the file
using python's utf-16 decode support. Then we verify that each
character's code point is 0xffff or less.

v3:
* Based on Mike Kinney's feedback, we now read the whole file and
verify up-front that it contains valid UCS-2 characters. Thanks
also to Laszlo Ersek for pointing out the Supplementary Plane
characters.

v4:
* Reject code points in 0xd800-0xdfff range since they are reserved
for UTF-16 surrogate pairs. (lersek)

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
Cc: Michael D Kinney <***@intel.com>
Cc: Laszlo Ersek <***@redhat.com>
---
BaseTools/Source/Python/AutoGen/UniClassObject.py | 88 ++++++++++++++++++++++-
1 file changed, 86 insertions(+), 2 deletions(-)

diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py b/BaseTools/Source/Python/AutoGen/UniClassObject.py
index aa54f4f..386e1ec 100644
--- a/BaseTools/Source/Python/AutoGen/UniClassObject.py
+++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py
@@ -19,6 +19,7 @@
import Common.LongFilePathOs as os, codecs, re
import distutils.util
import Common.EdkLogger as EdkLogger
+import StringIO
from Common.BuildToolError import *
from Common.String import GetLineNo
from Common.Misc import PathClass
@@ -147,6 +148,37 @@ def GetLanguageCode(LangName, IsCompatibleMode, File):

EdkLogger.error("Unicode File Parser", FORMAT_INVALID, "Invalid RFC 4646 language code : %s" % LangName, File)

+## Ucs2Codec
+#
+# This is only a partial codec implementation. It only supports
+# encoding, and is primarily used to check that all the characters are
+# valid for UCS-2.
+#
+class Ucs2Codec(codecs.Codec):
+ def __init__(self):
+ self.__utf16 = codecs.lookup('utf-16')
+
+ def encode(self, input, errors='strict'):
+ for Char in input:
+ CodePoint = ord(Char)
+ if CodePoint >= 0xd800 and CodePoint <= 0xdfff:
+ raise ValueError("Code Point is in range reserved for " +
+ "UTF-16 surrogate pairs")
+ elif CodePoint > 0xffff:
+ raise ValueError("Code Point too large to encode in UCS-2")
+ return self.__utf16.encode(input)
+
+TheUcs2Codec = Ucs2Codec()
+def Ucs2Search(name):
+ if name == 'ucs-2':
+ return codecs.CodecInfo(
+ name=name,
+ encode=TheUcs2Codec.encode,
+ decode=TheUcs2Codec.decode)
+ else:
+ return None
+codecs.register(Ucs2Search)
+
## StringDefClassObject
#
# A structure for language definition
@@ -209,7 +241,7 @@ class UniFileClassObject(object):
Lang = distutils.util.split_quoted((Line.split(u"//")[0]))
if len(Lang) != 3:
try:
- FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16').read()
+ FileIn = self.OpenUniFile(LongFilePath(File.Path))
except UnicodeError, X:
EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File);
except:
@@ -253,6 +285,58 @@ class UniFileClassObject(object):
self.OrderedStringDict[LangName][Item.StringName] = len(self.OrderedStringList[LangName]) - 1
return True

+ def OpenUniFile(self, FileName):
+ #
+ # Read file
+ #
+ try:
+ UniFile = open(FileName, mode='rb')
+ FileIn = UniFile.read()
+ UniFile.close()
+ except:
+ EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File)
+
+ #
+ # We currently only support UTF-16
+ #
+ Encoding = 'utf-16'
+
+ self.VerifyUcs2Data(FileIn, FileName, Encoding)
+
+ UniFile = StringIO.StringIO(FileIn)
+ Info = codecs.lookup(Encoding)
+ (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+ return codecs.StreamReaderWriter(UniFile, Reader, Writer)
+
+ def VerifyUcs2Data(self, FileIn, FileName, Encoding):
+ Ucs2Info = codecs.lookup('ucs-2')
+ #
+ # Convert to unicode
+ #
+ try:
+ FileDecoded = codecs.decode(FileIn, Encoding)
+ Ucs2Info.encode(FileDecoded)
+ except:
+ UniFile = StringIO.StringIO(FileIn)
+ Info = codecs.lookup(Encoding)
+ (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+ File = codecs.StreamReaderWriter(UniFile, Reader, Writer)
+ LineNumber = 0
+ ErrMsg = lambda Encoding, LineNumber: \
+ '%s contains invalid %s characters on line %d.' % \
+ (FileName, Encoding, LineNumber)
+ while True:
+ LineNumber = LineNumber + 1
+ try:
+ Line = File.readline()
+ if Line == '':
+ EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+ ErrMsg(Encoding, LineNumber))
+ Ucs2Info.encode(Line)
+ except:
+ EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+ ErrMsg('UCS-2', LineNumber))
+
#
# Get String name and value
#
@@ -305,7 +389,7 @@ class UniFileClassObject(object):
EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, ExtraData=File.Path)

try:
- FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16')
+ FileIn = self.OpenUniFile(LongFilePath(File.Path))
except UnicodeError, X:
EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File.Path);
except:

--
2.1.4

------------------------------------------------------------------------------

Laszlo Ersek

2015-06-04 12:44:49 UTC

Permalink

Post by Jordan Justen
Supplementary Plane characters can exist in UTF-16 files,
but they are not valid UCS-2 characters.

Post by Jordan Justen
import codecs
codecs.encode(u'\U00010300', 'utf-16')

'\xff\xfe\x00\xd8\x00\xdf'
Therefore the UCS-4 0x00010300 character is encoded as two
16-bit numbers (0xd800 0xdf00) in a little endian UTF-16
file.
http://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF
This means that our current BaseTools code could be allowing
unsupported UTF-16 characters be used. To fix this, we decode the file
using python's utf-16 decode support. Then we verify that each
character's code point is 0xffff or less.
* Based on Mike Kinney's feedback, we now read the whole file and
verify up-front that it contains valid UCS-2 characters. Thanks
also to Laszlo Ersek for pointing out the Supplementary Plane
characters.
* Reject code points in 0xd800-0xdfff range since they are reserved
for UTF-16 surrogate pairs. (lersek)
Contributed-under: TianoCore Contribution Agreement 1.0
---
BaseTools/Source/Python/AutoGen/UniClassObject.py | 88 ++++++++++++++++++++++-
1 file changed, 86 insertions(+), 2 deletions(-)
diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py b/BaseTools/Source/Python/AutoGen/UniClassObject.py
index aa54f4f..386e1ec 100644
--- a/BaseTools/Source/Python/AutoGen/UniClassObject.py
+++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py
@@ -19,6 +19,7 @@
import Common.LongFilePathOs as os, codecs, re
import distutils.util
import Common.EdkLogger as EdkLogger
+import StringIO
from Common.BuildToolError import *
from Common.String import GetLineNo
from Common.Misc import PathClass
EdkLogger.error("Unicode File Parser", FORMAT_INVALID, "Invalid RFC 4646 language code : %s" % LangName, File)
+## Ucs2Codec
+#
+# This is only a partial codec implementation. It only supports
+# encoding, and is primarily used to check that all the characters are
+# valid for UCS-2.
+#
+ self.__utf16 = codecs.lookup('utf-16')
+
+ CodePoint = ord(Char)
+ raise ValueError("Code Point is in range reserved for " +
+ "UTF-16 surrogate pairs")
+ raise ValueError("Code Point too large to encode in UCS-2")
+ return self.__utf16.encode(input)
+
+TheUcs2Codec = Ucs2Codec()
+ return codecs.CodecInfo(
+ name=name,
+ encode=TheUcs2Codec.encode,
+ decode=TheUcs2Codec.decode)
+ return None
+codecs.register(Ucs2Search)
+
## StringDefClassObject
#
# A structure for language definition
Lang = distutils.util.split_quoted((Line.split(u"//")[0]))
- FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16').read()
+ FileIn = self.OpenUniFile(LongFilePath(File.Path))
EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File);
self.OrderedStringDict[LangName][Item.StringName] = len(self.OrderedStringList[LangName]) - 1
return True
+ #
+ # Read file
+ #
+ UniFile = open(FileName, mode='rb')
+ FileIn = UniFile.read()
+ UniFile.close()
+ EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File)
+
+ #
+ # We currently only support UTF-16
+ #
+ Encoding = 'utf-16'
+
+ self.VerifyUcs2Data(FileIn, FileName, Encoding)
+
+ UniFile = StringIO.StringIO(FileIn)
+ Info = codecs.lookup(Encoding)
+ (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+ return codecs.StreamReaderWriter(UniFile, Reader, Writer)
+
+ Ucs2Info = codecs.lookup('ucs-2')
+ #
+ # Convert to unicode
+ #
+ FileDecoded = codecs.decode(FileIn, Encoding)
+ Ucs2Info.encode(FileDecoded)
+ UniFile = StringIO.StringIO(FileIn)
+ Info = codecs.lookup(Encoding)
+ (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+ File = codecs.StreamReaderWriter(UniFile, Reader, Writer)
+ LineNumber = 0
+ ErrMsg = lambda Encoding, LineNumber: \
+ '%s contains invalid %s characters on line %d.' % \
+ (FileName, Encoding, LineNumber)
+ LineNumber = LineNumber + 1
+ Line = File.readline()
+ EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+ ErrMsg(Encoding, LineNumber))
+ Ucs2Info.encode(Line)
+ EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+ ErrMsg('UCS-2', LineNumber))
+
#
# Get String name and value
#
EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, ExtraData=File.Path)
- FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16')
+ FileIn = self.OpenUniFile(LongFilePath(File.Path))
EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File.Path);

Reviewed-by: Laszlo Ersek <***@redhat.com>

------------------------------------------------------------------------------

Jordan Justen

2015-06-04 06:42:17 UTC

Permalink

This allows .uni input files to be encoded with UTF-8. Today, we only
support UTF-16 encoding.

The strings are still converted to UCS-2 data for use in EDK II
modules. (This is the only unicode character format supported by UEFI
and EDK II.)

Although UTF-8 would allow any UCS-4 character to be present in the
source file, we restrict the entire file to the UCS-2 range.
(Including comments.) This allows the files to be converted to UTF-16
if needed.

v2:
* Drop .utf8 extension. Use .uni file for UTF-8 data (mdkinney)
* Merge in 'BaseTools/UniClassObject: Verify string data is 16-bit'
commit

v3:
* Restrict the entire file's characters (including comments) to the
UCS-2 range in addition to string data. (mdkinney)

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
Cc: Michael D Kinney <***@intel.com>
---
BaseTools/Source/Python/AutoGen/UniClassObject.py | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py b/BaseTools/Source/Python/AutoGen/UniClassObject.py
index 386e1ec..1578205 100644
--- a/BaseTools/Source/Python/AutoGen/UniClassObject.py
+++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py
@@ -297,9 +297,12 @@ class UniFileClassObject(object):
EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File)

#
- # We currently only support UTF-16
+ # Detect Byte Order Mark at beginning of file. Default to UTF-8
#
- Encoding = 'utf-16'
+ Encoding = 'utf-8'
+ if (FileIn.startswith(codecs.BOM_UTF16_BE) or
+ FileIn.startswith(codecs.BOM_UTF16_LE)):
+ Encoding = 'utf-16'

self.VerifyUcs2Data(FileIn, FileName, Encoding)

--
2.1.4

------------------------------------------------------------------------------

Jordan Justen

2015-06-04 06:42:16 UTC

Permalink

Supplementary Plane characters can exist in UTF-16 files,
but they are not valid UCS-2 characters.

Post by Jordan Justen
import codecs
codecs.encode(u'\U00010300', 'utf-16')

--
2.1.4

------------------------------------------------------------------------------

Laszlo Ersek

2015-06-04 12:44:59 UTC

Permalink

Post by Jordan Justen
Supplementary Plane characters can exist in UTF-16 files,
but they are not valid UCS-2 characters.

Post by Jordan Justen
import codecs
codecs.encode(u'\U00010300', 'utf-16')

'\xff\xfe\x00\xd8\x00\xdf'
Therefore the UCS-4 0x00010300 character is encoded as two
16-bit numbers (0xd800 0xdf00) in a little endian UTF-16
file.
http://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF
This test checks to make sure that BaseTools will reject these
characters in UTF-16 files.
The range of 0xd800 - 0xdfff should also be rejected as unicode code
points because they are reserved for the surrogate pair usage in
UTF-16 files.
"BaseTools/UniClassObject: Verify valid UCS-2 chars in UTF-16 .uni files"
Contributed-under: TianoCore Contribution Agreement 1.0
---
BaseTools/Tests/CheckUnicodeSourceFiles.py | 35 +++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)
diff --git a/BaseTools/Tests/CheckUnicodeSourceFiles.py b/BaseTools/Tests/CheckUnicodeSourceFiles.py
index 0083ad8..ad5fd18 100644
--- a/BaseTools/Tests/CheckUnicodeSourceFiles.py
+++ b/BaseTools/Tests/CheckUnicodeSourceFiles.py
string = self.SampleData
- data = codecs.encode(string, encoding)
+ data = codecs.encode(string, encoding)
+ data = string
path = 'input.uni'
self.WriteTmpFile(path, data)
return PathClass(self.GetTmpFilePath(path))
self.CheckFile('utf_16', shouldPass=True)
+ #
+ # Supplementary Plane characters can exist in UTF-16 files,
+ # but they are not valid UCS-2 characters.
+ #
+ # This test makes sure that BaseTools rejects these characters
+ # if seen in a .uni file.
+ #
+ data = u'''
+ #langdef en-US "English"
+ #string STR_A #language en-US "CodePoint (\U00010300) > 0xFFFF"
+ '''
+
+ self.CheckFile('utf_16', shouldPass=False, string=data)
+
+ #
+ # Surrogate Pair code points are used in UTF-16 files to
+ # encode the Supplementary Plane characters. But, a Surrogate
+ # Pair code point which is not followed by another Surrogate
+ # Pair code point might be interpreted as a single code point
+ # with the Surrogate Pair code point.
+ #
+ # This test makes sure that BaseTools rejects these characters
+ # if seen in a .uni file.
+ #
+ data = codecs.BOM_UTF16_LE + '//\x01\xd8 '
+
+ self.CheckFile(encoding=None, shouldPass=False, string=data)
+
TheTestSuite = TestTools.MakeTheTestSuite(locals())

Reviewed-by: Laszlo Ersek <***@redhat.com>

------------------------------------------------------------------------------

Jordan Justen

2015-06-04 06:42:18 UTC

Permalink

Since UTF-8 .uni unicode files might contain strings with unicode code
points larger than 16-bits, and UEFI only supports UCS-2 characters,
we need to make sure that BaseTools rejects these characters in UTF-8
.uni source files.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
---
BaseTools/Tests/CheckUnicodeSourceFiles.py | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/BaseTools/Tests/CheckUnicodeSourceFiles.py b/BaseTools/Tests/CheckUnicodeSourceFiles.py
index ad5fd18..102dc3c 100644
--- a/BaseTools/Tests/CheckUnicodeSourceFiles.py
+++ b/BaseTools/Tests/CheckUnicodeSourceFiles.py
@@ -114,6 +114,31 @@ class Tests(TestTools.BaseToolsTest):

self.CheckFile(encoding=None, shouldPass=False, string=data)

+ def test32bitUnicodeCharInUtf8File(self):
+ data = u'''
+ #langdef en-US "English"
+ #string STR_A #language en-US "CodePoint (\U00010300) > 0xFFFF"
+ '''
+
+ self.CheckFile('utf_16', shouldPass=False, string=data)
+
+ def test32bitUnicodeCharInUtf8File(self):
+ data = u'''
+ #langdef en-US "English"
+ #string STR_A #language en-US "CodePoint (\U00010300) > 0xFFFF"
+ '''
+
+ self.CheckFile('utf_8', shouldPass=False, string=data)
+
+ def test32bitUnicodeCharInUtf8Comment(self):
+ data = u'''
+ // Even in comments, we reject non-UCS-2 chars: \U00010300
+ #langdef en-US "English"
+ #string STR_A #language en-US "A"
+ '''
+
+ self.CheckFile('utf_8', shouldPass=False, string=data)
+
TheTestSuite = TestTools.MakeTheTestSuite(locals())

if __name__ == '__main__':

--
2.1.4

------------------------------------------------------------------------------

Jordan Justen

2015-06-04 06:42:19 UTC

Permalink

Surrogate pair characters can be encoded in UTF-8 files, but they are
not valid UCS-2 characters.

Post by Jordan Justen
import codecs
codecs.encode(u'\ud801', 'utf-8')

'\xed\xa0\x81'

But, the range of 0xd800 - 0xdfff should be rejected as unicode code
points because they are reserved for the surrogate pair usage in
UTF-16 files.

We test that this case is rejected for UTF-8 with and without the
UTF-8 BOM.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
Cc: Michael D Kinney <***@intel.com>
Cc: Laszlo Ersek <***@redhat.com>
---
BaseTools/Tests/CheckUnicodeSourceFiles.py | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/BaseTools/Tests/CheckUnicodeSourceFiles.py b/BaseTools/Tests/CheckUnicodeSourceFiles.py
index 102dc3c..2eeb0f5 100644
--- a/BaseTools/Tests/CheckUnicodeSourceFiles.py
+++ b/BaseTools/Tests/CheckUnicodeSourceFiles.py
@@ -139,6 +139,30 @@ class Tests(TestTools.BaseToolsTest):

self.CheckFile('utf_8', shouldPass=False, string=data)

+ def testSurrogatePairUnicodeCharInUtf8File(self):
+ #
+ # Surrogate Pair code points are used in UTF-16 files to
+ # encode the Supplementary Plane characters. In UTF-8, it is
+ # trivial to encode these code points, but they are not valid
+ # code points for characters, since they are reserved for the
+ # UTF-16 Surrogate Pairs.
+ #
+ # This test makes sure that BaseTools rejects these characters
+ # if seen in a .uni file.
+ #
+ data = '\xed\xa0\x81'
+
+ self.CheckFile(encoding=None, shouldPass=False, string=data)
+
+ def testSurrogatePairUnicodeCharInUtf8FileWithBom(self):
+ #
+ # Same test as testSurrogatePairUnicodeCharInUtf8File, but add
+ # the UTF-8 BOM
+ #
+ data = codecs.BOM_UTF8 + '\xed\xa0\x81'
+
+ self.CheckFile(encoding=None, shouldPass=False, string=data)
+
TheTestSuite = TestTools.MakeTheTestSuite(locals())

if __name__ == '__main__':

--
2.1.4

------------------------------------------------------------------------------

Laszlo Ersek

2015-06-04 12:45:05 UTC

Permalink

Post by Jordan Justen
Surrogate pair characters can be encoded in UTF-8 files, but they are
not valid UCS-2 characters.

Post by Jordan Justen
import codecs
codecs.encode(u'\ud801', 'utf-8')

'\xed\xa0\x81'
But, the range of 0xd800 - 0xdfff should be rejected as unicode code
points because they are reserved for the surrogate pair usage in
UTF-16 files.
We test that this case is rejected for UTF-8 with and without the
UTF-8 BOM.
Contributed-under: TianoCore Contribution Agreement 1.0
---
BaseTools/Tests/CheckUnicodeSourceFiles.py | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/BaseTools/Tests/CheckUnicodeSourceFiles.py b/BaseTools/Tests/CheckUnicodeSourceFiles.py
index 102dc3c..2eeb0f5 100644
--- a/BaseTools/Tests/CheckUnicodeSourceFiles.py
+++ b/BaseTools/Tests/CheckUnicodeSourceFiles.py
self.CheckFile('utf_8', shouldPass=False, string=data)
+ #
+ # Surrogate Pair code points are used in UTF-16 files to
+ # encode the Supplementary Plane characters. In UTF-8, it is
+ # trivial to encode these code points, but they are not valid
+ # code points for characters, since they are reserved for the
+ # UTF-16 Surrogate Pairs.
+ #
+ # This test makes sure that BaseTools rejects these characters
+ # if seen in a .uni file.
+ #
+ data = '\xed\xa0\x81'
+
+ self.CheckFile(encoding=None, shouldPass=False, string=data)
+
+ #
+ # Same test as testSurrogatePairUnicodeCharInUtf8File, but add
+ # the UTF-8 BOM
+ #
+ data = codecs.BOM_UTF8 + '\xed\xa0\x81'
+
+ self.CheckFile(encoding=None, shouldPass=False, string=data)
+
TheTestSuite = TestTools.MakeTheTestSuite(locals())

Reviewed-by: Laszlo Ersek <***@redhat.com>

------------------------------------------------------------------------------

Jordan Justen

2015-06-04 06:42:20 UTC

Permalink

We test a simple case of UTF-8 with and without the UTF-8 BOM.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
Cc: Michael D Kinney <***@intel.com>
Cc: Laszlo Ersek <***@redhat.com>
---
BaseTools/Tests/CheckUnicodeSourceFiles.py | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/BaseTools/Tests/CheckUnicodeSourceFiles.py b/BaseTools/Tests/CheckUnicodeSourceFiles.py
index 2eeb0f5..6ae62f1 100644
--- a/BaseTools/Tests/CheckUnicodeSourceFiles.py
+++ b/BaseTools/Tests/CheckUnicodeSourceFiles.py
@@ -114,6 +114,17 @@ class Tests(TestTools.BaseToolsTest):

self.CheckFile(encoding=None, shouldPass=False, string=data)

+ def testValidUtf8File(self):
+ self.CheckFile(encoding='utf_8', shouldPass=True)
+
+ def testValidUtf8FileWithBom(self):
+ #
+ # Same test as testValidUtf8File, but add the UTF-8 BOM
+ #
+ data = codecs.BOM_UTF8 + codecs.encode(self.SampleData, 'utf_8')
+
+ self.CheckFile(encoding=None, shouldPass=True, string=data)
+
def test32bitUnicodeCharInUtf8File(self):
data = u'''
#langdef en-US "English"

--
2.1.4

------------------------------------------------------------------------------

Laszlo Ersek

2015-06-04 12:45:10 UTC

Permalink

Post by Jordan Justen
We test a simple case of UTF-8 with and without the UTF-8 BOM.
Contributed-under: TianoCore Contribution Agreement 1.0
---
BaseTools/Tests/CheckUnicodeSourceFiles.py | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/BaseTools/Tests/CheckUnicodeSourceFiles.py b/BaseTools/Tests/CheckUnicodeSourceFiles.py
index 2eeb0f5..6ae62f1 100644
--- a/BaseTools/Tests/CheckUnicodeSourceFiles.py
+++ b/BaseTools/Tests/CheckUnicodeSourceFiles.py
self.CheckFile(encoding=None, shouldPass=False, string=data)
+ self.CheckFile(encoding='utf_8', shouldPass=True)
+
+ #
+ # Same test as testValidUtf8File, but add the UTF-8 BOM
+ #
+ data = codecs.BOM_UTF8 + codecs.encode(self.SampleData, 'utf_8')
+
+ self.CheckFile(encoding=None, shouldPass=True, string=data)
+
data = u'''
#langdef en-US "English"

Reviewed-by: Laszlo Ersek <***@redhat.com>

------------------------------------------------------------------------------

Jordan Justen

2015-06-04 06:42:21 UTC

Permalink

This command was used to convert the file:
iconv -f UTF-16 -t UTF-8 \
-o OvmfPkg/PlatformDxe/Platform.uni \
OvmfPkg/PlatformDxe/Platform.uni

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Reviewed-by: Laszlo Ersek <***@redhat.com>
---
OvmfPkg/PlatformDxe/Platform.uni | Bin 3298 -> 1648 bytes
1 file changed, 0 insertions(+), 0 deletions(-)

diff --git a/OvmfPkg/PlatformDxe/Platform.uni b/OvmfPkg/PlatformDxe/Platform.uni
index d8d5b0bb4de9c0632dc183369fa7d46f455e9143..6df865519f35e302430f1ad290729011fd4c9048 100644
GIT binary patch
literal 1648
zcmai!QE$^Q5Xaw<_ztJ-1=yAX;)My2($-@UYg#2KjHk*?E{UbYj%=sf`1G9JEU<Qh
zN>!Wq?&rV%zPrf;-khDid@-50FU(Z;phZ<%cr|+***@87=ra1IF;aLw&GL^2N!qjDGZ
z_M=<0*igRil;&0_89>-H;9&+d8q_;1f=|=r%eY%s3j>{2mF6vQS%9q(c%G}a<MKhs
z3R-Sa3*H#u8le$6N<(s7Y|9G@-f_|JZG&D{FALNjLRl^4P*|>HA)Foqs`P8qbPhLr
z65Q1yug5I~8j29c!wO-n7TbP*mW-***@Jsrs6y?rDNdPvFxY-wGQ0N~cA*VcBIlZom`
zvFde>fzs7v$S{+wDK3VGpsTw-mRvJfCCjf#xPT~yd6Z^JG+k$G4(oW%638gPpCFpC
zIAySmA&lW9Oey>XrePYT=U%{%D7#*+Gx!lENf>7lOJSKn!d3}OS)7Ggw2bN16{Y`#
zZ&5ry2SzEh1-***@IF5H8n#p)(vbAR>z#X=Q*gAnr;FGt}3tA^WB={D%47+;55a*^lu
***@4%FNrMoS#6mqy4${X{qh+%?VsYl4g#TzP5di;Fqeoh-ME6N6x7wZ&Gn5-IM>Jz`_
zE{))1+vaMSEK-(***@FnoM+&ntY)UVdL(jeAo8%TiSRzJ!T*`V8-y-K-vQaKxL{
zqz+-nwNSkQkM9O+***@W7x&***@84pN(4LhJhyw!M*k-m)MqU2T5n-jVM6IcX
z5;Usu#Z46pv8(E-QucZ3=HEwl&C!iy>h(wR&~***@2Js?bC=erZ7oQ*<Cb;|AhOW<
znL|C)Q-9N_p>n_-N)U3#8&SZk8fA_kDyc~PYq-;t=***@6sW*;ixI!@&{k(k%x<$M
zt{j;%+}N|;xWfFGIgb9sog?~B)kDrDm=1GMJ^-hq-wk&8L%A****@yQDXqkTq<u1
zG%HKd&}4H1u;N0}yZ>8|hTqwS-9}G5I1)KiZ7VG7o!Adfb}tUAfu_+cOy*B<N36Lp
MIoHK=u$M1>1Ja-Rz5oCK

literal 3298
zcmb`J+fP$L5XR@(#Q)(0Uc3Qne37ULm!geA2`ws5O<UU1q!-dtK>Xv?-#6Pmrx)l^
zVw#@4&CYx~^Udu3{<***@0U&hN23VziaDiS60QF74LF*0Zi%*aX=p%s!=kZ7=PW
zy|EYcPpoa{w4bbjjAqI<w3cla>S|@WUfSDi=LCIyw8;J#1o}#IkKPciiS4j1i5yw#
zqqe2&9ow}d+O&7J3%U;(j(EnzYMiHsbb|***@68z_@`8oo0eW+s7@=GO_`ZTDxW6
z5cz}|p_08Gy}oN}Fw(9*b1iy9M<jhwuXdeJHDA;3A=|Jf-*$#Gu`5R*-8qtAYcRDF
zPzsrPz05y4)5tnA`*y8r`;5QLVM^@AebC}7bn~a|fkv9-1^FrWoNT4c>(otf&c;Pv
z)#****@Jt4`dcG&HE?#@)oJ)bpL4T0U?{rSQiT?L}J^yDt-nZOMdJg-{kaT;<L37
zgOr&j$***@zKVPz977yRQHV=I<?vJ$9{VNu0C^4+mR#$`O3;8V3VX3OyGlwQgGd}>Fu
zCFXMtc?`<%x`ag4HCO0~&-$*>O0Tr8(eC<KqZb*r#J;gETot#***@eY=R?1=d?TBW1n
z)=wA-yXIuGhLmJvHAo#***@GpIH%dWpHU~C7hyU-!nZ?d+wz&J~@V6vb{f(^|{0$
zF*$cmh#***@s;vs2}g3G2tiM^W#1BnN0zHuUb0-tmZvi{kI(UX4}O#Mh9w%D#?|
z=CdZ)Teg2N#gT&Knw{f2kZi!JXVdB#PV?***@T~4EQKas2l+ljMwg9yr-YxqrT%83*v
zuFdR=nHG=-w$<&x<9p~)ty$e;S|bAD>ijRae1(s?*j=<;v9H8WJXrL5sK7(_RoC+?
zufhT2dR7Tm<5L!WLH|59jf(N@;Str&fL#?+kyO5??9b6(F0dwHy-nOzpR`-iTaEgV
zOi%{o{=Lt$#i(;!)ddu*F@#*LQzK42gO@!PXNoZ#<0&y+w}^VWg>+Y^c(0HDV&x7s
zR(Tm~^)3=4*8c@--D~ATqa1Gz-NuWUHM5L){*L>|zJ}Lv-MeCjUJtr`byl%n+)bE?
zXKYT-sP{zK*9I^pzH`Doq`DJWq?)LEMcJd*#gLwOB|a)|<=ZEI`|0H)d(t)***@UNX
zOgHSGz{31SzxJJNq!gKOx1`MdL_-l<BU6sZt?yKT$cyE+m?`r)_V>Jr74xCYiP-Pw
zPBHc~ym!UvTSP)pN&QXG!iyc=YKnV^`VH+YyA;C^sd$R`;w$EoUO8f=Vi}nms<${6
z-j%vx`oud(h8JKLG-8da`u|};71n%(0d;4AUIxqY4>QG{RK^VW=~h(!<aJm16yL8U
k-gA;zT^qvNXb*HJya`sJE5@~tz0~77_B{JLWV(0%07EhSy8r+H

--
2.1.4

------------------------------------------------------------------------------

Jordan Justen

2015-06-12 07:11:14 UTC

Permalink

Dennis,

Are you okay if I commit these patches to BaseTools with Mike and
Laszlo's r-bs?

Thanks,

-Jordan

Post by Jordan Justen
https://github.com/jljusten/edk2.git utf8-v4
* Reject characters in the range 0xd800 - 0xdfff since they are
reserved for UTF-16 Surrogate Pairs. (lersek)
* Add more tests to verify the Surrogate Pairs range is rejected, and
that UTF-8 BOM is allowed. (lersek)
* v2 fixed the USC-2 issue with UTF-16 file by 'accident'. Now this
is done in separate patches. (Patches 3 & 4)
* Validate entire file by loading the entire contents (mdkinney)
* Add stub version of ucs-2 codec to verify unicode file contents are
valid USC-2 characters.
* Drop .utf8 extension. Use .uni file for UTF-8 data (mdkinney)
* They must be checked in as 'binary' files
* It is difficult to produce a diff of changes
* UTF-8 is more likely to be supported by text editors
This series allows .uni files to contain UTF-8 (or, as before, UTF-16)
string data. If the UTF-16 LE or BE BOM is found, then the file is
read as UTF-16. Otherwise, it is treated as UTF-8.
BaseTools/Tests: Always add BaseTools source to import path
BaseTools/EdkLogger: Support unit tests with a SILENT log level
BaseTools/Tests: Add unit test for AutoGen.UniClassObject
BaseTools/UniClassObject: Verify valid UCS-2 chars in UTF-16 .uni
files
BaseTools/Tests: Verify unsupported UTF-16 are rejected
BaseTools/UniClassObject: Support UTF-8 string data in .uni files
BaseTools/Tests: Verify 32-bit UTF-8 chars are rejected
BaseTools/Tests: Verify unsupported UTF-8 data is rejected
BaseTools/Tests: Verify supported UTF-8 data is allowed
OvmfPkg/PlatformDxe: Convert Platform.uni to UTF-8
BaseTools/Source/Python/AutoGen/UniClassObject.py | 91 ++++++++++-
BaseTools/Source/Python/Common/EdkLogger.py | 9 +-
BaseTools/Tests/CheckUnicodeSourceFiles.py | 181 ++++++++++++++++++++++
BaseTools/Tests/PythonToolsTests.py | 4 +-
BaseTools/Tests/RunTests.py | 2 -
BaseTools/Tests/TestTools.py | 9 +-
OvmfPkg/PlatformDxe/Platform.uni | Bin 3298 -> 1648 bytes
7 files changed, 289 insertions(+), 7 deletions(-)
create mode 100644 BaseTools/Tests/CheckUnicodeSourceFiles.py
--
2.1.4

------------------------------------------------------------------------------

Liu, Yingke D

2015-06-17 09:53:32 UTC

Permalink

Jordan,

I'm OK for the patches. Thanks for your contribution.

Reviewed-by: Yingke Liu <***@intel.com>

Dennis

-----Original Message-----
From: Justen, Jordan L
Sent: Friday, June 12, 2015 15:11
To: Liu, Yingke D; edk2-***@lists.sourceforge.net
Subject: Re: [PATCH v4 00/10] Support UTF-8 in .uni string files

Dennis,

Are you okay if I commit these patches to BaseTools with Mike and Laszlo's r-bs?

Thanks,

-Jordan

Post by Jordan Justen
https://github.com/jljusten/edk2.git utf8-v4
* Reject characters in the range 0xd800 - 0xdfff since they are
reserved for UTF-16 Surrogate Pairs. (lersek)
* Add more tests to verify the Surrogate Pairs range is rejected, and
that UTF-8 BOM is allowed. (lersek)
* v2 fixed the USC-2 issue with UTF-16 file by 'accident'. Now this
is done in separate patches. (Patches 3 & 4)
* Validate entire file by loading the entire contents (mdkinney)
* Add stub version of ucs-2 codec to verify unicode file contents are
valid USC-2 characters.
* Drop .utf8 extension. Use .uni file for UTF-8 data (mdkinney)
* They must be checked in as 'binary' files
* It is difficult to produce a diff of changes
* UTF-8 is more likely to be supported by text editors
This series allows .uni files to contain UTF-8 (or, as before, UTF-16)
string data. If the UTF-16 LE or BE BOM is found, then the file is
read as UTF-16. Otherwise, it is treated as UTF-8.
BaseTools/Tests: Always add BaseTools source to import path
BaseTools/EdkLogger: Support unit tests with a SILENT log level
BaseTools/Tests: Add unit test for AutoGen.UniClassObject
BaseTools/UniClassObject: Verify valid UCS-2 chars in UTF-16 .uni
files
BaseTools/Tests: Verify unsupported UTF-16 are rejected
BaseTools/UniClassObject: Support UTF-8 string data in .uni files
BaseTools/Tests: Verify 32-bit UTF-8 chars are rejected
BaseTools/Tests: Verify unsupported UTF-8 data is rejected
BaseTools/Tests: Verify supported UTF-8 data is allowed
OvmfPkg/PlatformDxe: Convert Platform.uni to UTF-8
BaseTools/Source/Python/AutoGen/UniClassObject.py | 91 ++++++++++-
BaseTools/Source/Python/Common/EdkLogger.py | 9 +-
BaseTools/Tests/CheckUnicodeSourceFiles.py | 181 ++++++++++++++++++++++
BaseTools/Tests/PythonToolsTests.py | 4 +-
BaseTools/Tests/RunTests.py | 2 -
BaseTools/Tests/TestTools.py | 9 +-
OvmfPkg/PlatformDxe/Platform.uni | Bin 3298 -> 1648 bytes
7 files changed, 289 insertions(+), 7 deletions(-) create mode
100644 BaseTools/Tests/CheckUnicodeSourceFiles.py
--
2.1.4

------------------------------------------------------------------------------