Discussion:
[edk2] [PATCH v3 0/8] Support UTF-8 in .uni string files
Jordan Justen
2015-06-01 07:31:33 UTC
Permalink
https://github.com/jljusten/edk2.git utf8-v3

v3:
* v2 fixed the USC-2 issue with UTF-16 file by 'accident'. Now this
is done in separate patches. (Patches 3 & 4)
* Validate entire file by loading the entire contents (mdkinney)
* Add stub version of ucs-2 codec to verify unicode file contents are
valid USC-2 characters.

v2:
* Drop .utf8 extension. Use .uni file for UTF-8 data (mdkinney)

The UTF-16 .uni files are fairly annoying to work with:
* They must be checked in as 'binary' files
* It is difficult to produce a diff of changes
* UTF-8 is more likely to be supported by text editors

This series allows .uni files to contain UTF-8 (or, as before, UTF-16)
string data. If the UTF-16 LE or BE BOM is found, then the file is
read as UTF-16. Otherwise, it is treated as UTF-8.

Jordan Justen (8):
BaseTools/Tests: Always add BaseTools source to import path
BaseTools/EdkLogger: Support unit tests with a SILENT log level
BaseTools/Tests: Add unit test for AutoGen.UniClassObject
BaseTools/UniClassObject: Verify valid UCS-2 chars in UTF-16 .uni
files
BaseTools/Tests: Verify unsupported UTF-16 are rejected
BaseTools/UniClassObject: Support UTF-8 string data in .uni files
BaseTools/Tests: Verify 32-bit UTF-8 chars are rejected
OvmfPkg/PlatformDxe: Convert Platform.uni to UTF-8

BaseTools/Source/Python/AutoGen/UniClassObject.py | 87 ++++++++++++++-
BaseTools/Source/Python/Common/EdkLogger.py | 9 +-
BaseTools/Tests/CheckUnicodeSourceFiles.py | 128 ++++++++++++++++++++++
BaseTools/Tests/PythonToolsTests.py | 4 +-
BaseTools/Tests/RunTests.py | 2 -
BaseTools/Tests/TestTools.py | 9 +-
OvmfPkg/PlatformDxe/Platform.uni | Bin 3298 -> 1648 bytes
7 files changed, 232 insertions(+), 7 deletions(-)
create mode 100644 BaseTools/Tests/CheckUnicodeSourceFiles.py
--
2.1.4


------------------------------------------------------------------------------
Jordan Justen
2015-06-01 07:31:34 UTC
Permalink
This allows unit tests to easily include BaseTools python
modules. This is very useful for writing unit tests.

Actually, previously, we would do this when RunTests.py was executed,
so unit tests could easily import BaseTools modules, so long as they
were executed via RunTests.

This change allows running the unit test files individually which can
be faster for developing the new unit test cases.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
---
BaseTools/Tests/RunTests.py | 2 --
BaseTools/Tests/TestTools.py | 9 ++++++++-
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/BaseTools/Tests/RunTests.py b/BaseTools/Tests/RunTests.py
index e8ca2d0..0dd6563 100644
--- a/BaseTools/Tests/RunTests.py
+++ b/BaseTools/Tests/RunTests.py
@@ -21,8 +21,6 @@ import unittest

import TestTools

-sys.path.append(TestTools.PythonSourceDir)
-
def GetCTestSuite():
import CToolsTests
return CToolsTests.TheTestSuite()
diff --git a/BaseTools/Tests/TestTools.py b/BaseTools/Tests/TestTools.py
index ac009db..27afd79 100644
--- a/BaseTools/Tests/TestTools.py
+++ b/BaseTools/Tests/TestTools.py
@@ -1,7 +1,7 @@
## @file
# Utility functions and classes for BaseTools unit tests
#
-# Copyright (c) 2008 - 2012, Intel Corporation. All rights reserved.<BR>
+# Copyright (c) 2008 - 2015, Intel Corporation. All rights reserved.<BR>
#
# This program and the accompanying materials
# are licensed and made available under the terms and conditions of the BSD License
@@ -31,6 +31,13 @@ CSourceDir = os.path.join(BaseToolsDir, 'Source', 'C')
PythonSourceDir = os.path.join(BaseToolsDir, 'Source', 'Python')
TestTempDir = os.path.join(TestsDir, 'TestTempDir')

+if PythonSourceDir not in sys.path:
+ #
+ # Allow unit tests to import BaseTools python modules. This is very useful
+ # for writing unit tests.
+ #
+ sys.path.append(PythonSourceDir)
+
def MakeTheTestSuite(localItems):
tests = []
for name, item in localItems.iteritems():
--
2.1.4


------------------------------------------------------------------------------
Jordan Justen
2015-06-01 07:31:35 UTC
Permalink
This allows the unit tests to run without the errors logging to the
screen.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
---
BaseTools/Source/Python/Common/EdkLogger.py | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/BaseTools/Source/Python/Common/EdkLogger.py b/BaseTools/Source/Python/Common/EdkLogger.py
index f048b61..8d6d426 100644
--- a/BaseTools/Source/Python/Common/EdkLogger.py
+++ b/BaseTools/Source/Python/Common/EdkLogger.py
@@ -32,6 +32,7 @@ INFO = 20
WARN = 30
QUIET = 40
ERROR = 50
+SILENT = 99

IsRaiseError = True

@@ -39,7 +40,9 @@ IsRaiseError = True
_ToolName = os.path.basename(sys.argv[0])

# For validation purpose
-_LogLevels = [DEBUG_0, DEBUG_1, DEBUG_2, DEBUG_3, DEBUG_4, DEBUG_5, DEBUG_6, DEBUG_7, DEBUG_8, DEBUG_9, VERBOSE, WARN, INFO, ERROR, QUIET]
+_LogLevels = [DEBUG_0, DEBUG_1, DEBUG_2, DEBUG_3, DEBUG_4, DEBUG_5,
+ DEBUG_6, DEBUG_7, DEBUG_8, DEBUG_9, VERBOSE, WARN, INFO,
+ ERROR, QUIET, SILENT]

# For DEBUG level (All DEBUG_0~9 are applicable)
_DebugLogger = logging.getLogger("tool_debug")
@@ -235,6 +238,10 @@ def SetLevel(Level):
_InfoLogger.setLevel(Level)
_ErrorLogger.setLevel(Level)

+def InitializeForUnitTest():
+ Initialize()
+ SetLevel(SILENT)
+
## Get current log level
def GetLevel():
return _InfoLogger.getEffectiveLevel()
--
2.1.4


------------------------------------------------------------------------------
Jordan Justen
2015-06-01 07:31:36 UTC
Permalink
This verifies that a UTF-16 data (with BOM) .uni file is successfully
read.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
---
BaseTools/Tests/CheckUnicodeSourceFiles.py | 88 ++++++++++++++++++++++++++++++
BaseTools/Tests/PythonToolsTests.py | 4 +-
2 files changed, 91 insertions(+), 1 deletion(-)
create mode 100644 BaseTools/Tests/CheckUnicodeSourceFiles.py

diff --git a/BaseTools/Tests/CheckUnicodeSourceFiles.py b/BaseTools/Tests/CheckUnicodeSourceFiles.py
new file mode 100644
index 0000000..0083ad8
--- /dev/null
+++ b/BaseTools/Tests/CheckUnicodeSourceFiles.py
@@ -0,0 +1,88 @@
+## @file
+# Unit tests for AutoGen.UniClassObject
+#
+# Copyright (c) 2015, Intel Corporation. All rights reserved.<BR>
+#
+# This program and the accompanying materials
+# are licensed and made available under the terms and conditions of the BSD License
+# which accompanies this distribution. The full text of the license may be found at
+# http://opensource.org/licenses/bsd-license.php
+#
+# THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS,
+# WITHOUT WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED.
+#
+
+##
+# Import Modules
+#
+import os
+import unittest
+
+import codecs
+
+import TestTools
+
+from Common.Misc import PathClass
+import AutoGen.UniClassObject as BtUni
+
+from Common import EdkLogger
+EdkLogger.InitializeForUnitTest()
+
+class Tests(TestTools.BaseToolsTest):
+
+ SampleData = u'''
+ #langdef en-US "English"
+ #string STR_A #language en-US "STR_A for en-US"
+ '''
+
+ def EncodeToFile(self, encoding, string=None):
+ if string is None:
+ string = self.SampleData
+ data = codecs.encode(string, encoding)
+ path = 'input.uni'
+ self.WriteTmpFile(path, data)
+ return PathClass(self.GetTmpFilePath(path))
+
+ def ErrorFailure(self, error, encoding, shouldPass):
+ msg = error + ' should '
+ if shouldPass:
+ msg += 'not '
+ msg += 'be generated for '
+ msg += '%s data in a .uni file' % encoding
+ self.fail(msg)
+
+ def UnicodeErrorFailure(self, encoding, shouldPass):
+ self.ErrorFailure('UnicodeError', encoding, shouldPass)
+
+ def EdkErrorFailure(self, encoding, shouldPass):
+ self.ErrorFailure('EdkLogger.FatalError', encoding, shouldPass)
+
+ def CheckFile(self, encoding, shouldPass, string=None):
+ path = self.EncodeToFile(encoding, string)
+ try:
+ BtUni.UniFileClassObject([path])
+ if shouldPass:
+ return
+ except UnicodeError:
+ if not shouldPass:
+ return
+ else:
+ self.UnicodeErrorFailure(encoding, shouldPass)
+ except EdkLogger.FatalError:
+ if not shouldPass:
+ return
+ else:
+ self.EdkErrorFailure(encoding, shouldPass)
+ except Exception:
+ pass
+
+ self.EdkErrorFailure(encoding, shouldPass)
+
+ def testUtf16InUniFile(self):
+ self.CheckFile('utf_16', shouldPass=True)
+
+TheTestSuite = TestTools.MakeTheTestSuite(locals())
+
+if __name__ == '__main__':
+ allTests = TheTestSuite()
+ unittest.TextTestRunner().run(allTests)
diff --git a/BaseTools/Tests/PythonToolsTests.py b/BaseTools/Tests/PythonToolsTests.py
index 6096e21..c953daf 100644
--- a/BaseTools/Tests/PythonToolsTests.py
+++ b/BaseTools/Tests/PythonToolsTests.py
@@ -1,7 +1,7 @@
## @file
# Unit tests for Python based BaseTools
#
-# Copyright (c) 2008, Intel Corporation. All rights reserved.<BR>
+# Copyright (c) 2008 - 2015, Intel Corporation. All rights reserved.<BR>
#
# This program and the accompanying materials
# are licensed and made available under the terms and conditions of the BSD License
@@ -24,6 +24,8 @@ def TheTestSuite():
suites = []
import CheckPythonSyntax
suites.append(CheckPythonSyntax.TheTestSuite())
+ import CheckUnicodeSourceFiles
+ suites.append(CheckUnicodeSourceFiles.TheTestSuite())
return unittest.TestSuite(suites)

if __name__ == '__main__':
--
2.1.4


------------------------------------------------------------------------------
Jordan Justen
2015-06-01 07:31:37 UTC
Permalink
Supplementary Plane characters can exist in UTF-16 files,
but they are not valid UCS-2 characters.
Post by Jordan Justen
import codecs
codecs.encode(u'\U00010300', 'utf-16')
'\xff\xfe\x00\xd8\x00\xdf'

Therefore the UCS-4 0x00010300 character is encoded as two
16-bit numbers (0xd800 0xdf00) in a little endian UTF-16
file.

For more information, see:
http://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF

This means that our current BaseTools code could be allowing
unsupported UTF-16 characters be used. To fix this, we decode the file
using python's utf-16 decode support. Then we verify that each
character's code point is 0xffff or less.

v3: Based on Mike Kinney's feedback, we now read the whole file and
verify up-front that it contains valid UCS-2 characters. Thanks
also to Laszlo Ersek for pointing out the Supplementary Plane
characters.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
Cc: Michael D Kinney <***@intel.com>
Cc: Laszlo Ersek <***@redhat.com>
---
BaseTools/Source/Python/AutoGen/UniClassObject.py | 84 ++++++++++++++++++++++-
1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py b/BaseTools/Source/Python/AutoGen/UniClassObject.py
index aa54f4f..66fdbf0 100644
--- a/BaseTools/Source/Python/AutoGen/UniClassObject.py
+++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py
@@ -19,6 +19,7 @@
import Common.LongFilePathOs as os, codecs, re
import distutils.util
import Common.EdkLogger as EdkLogger
+import StringIO
from Common.BuildToolError import *
from Common.String import GetLineNo
from Common.Misc import PathClass
@@ -147,6 +148,33 @@ def GetLanguageCode(LangName, IsCompatibleMode, File):

EdkLogger.error("Unicode File Parser", FORMAT_INVALID, "Invalid RFC 4646 language code : %s" % LangName, File)

+## Ucs2Codec
+#
+# This is only a partial codec implementation. It only supports
+# encoding, and is primarily used to check that all the characters are
+# valid for UCS-2.
+#
+class Ucs2Codec(codecs.Codec):
+ def __init__(self):
+ self.__utf16 = codecs.lookup('utf-16')
+
+ def encode(self, input, errors='strict'):
+ for Char in input:
+ if ord(Char) > 0xffff:
+ raise ValueError("Code Point too large to encode in UCS-2")
+ return self.__utf16.encode(input)
+
+TheUcs2Codec = Ucs2Codec()
+def Ucs2Search(name):
+ if name == 'ucs-2':
+ return codecs.CodecInfo(
+ name=name,
+ encode=TheUcs2Codec.encode,
+ decode=TheUcs2Codec.decode)
+ else:
+ return None
+codecs.register(Ucs2Search)
+
## StringDefClassObject
#
# A structure for language definition
@@ -209,7 +237,7 @@ class UniFileClassObject(object):
Lang = distutils.util.split_quoted((Line.split(u"//")[0]))
if len(Lang) != 3:
try:
- FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16').read()
+ FileIn = self.OpenUniFile(LongFilePath(File.Path))
except UnicodeError, X:
EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File);
except:
@@ -253,6 +281,58 @@ class UniFileClassObject(object):
self.OrderedStringDict[LangName][Item.StringName] = len(self.OrderedStringList[LangName]) - 1
return True

+ def OpenUniFile(self, FileName):
+ #
+ # Read file
+ #
+ try:
+ UniFile = open(FileName, mode='rb')
+ FileIn = UniFile.read()
+ UniFile.close()
+ except:
+ EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File)
+
+ #
+ # We currently only support UTF-16
+ #
+ Encoding = 'utf-16'
+
+ self.VerifyUcs2Data(FileIn, FileName, Encoding)
+
+ UniFile = StringIO.StringIO(FileIn)
+ Info = codecs.lookup(Encoding)
+ (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+ return codecs.StreamReaderWriter(UniFile, Reader, Writer)
+
+ def VerifyUcs2Data(self, FileIn, FileName, Encoding):
+ Ucs2Info = codecs.lookup('ucs-2')
+ #
+ # Convert to unicode
+ #
+ try:
+ FileDecoded = codecs.decode(FileIn, Encoding)
+ Ucs2Info.encode(FileDecoded)
+ except:
+ UniFile = StringIO.StringIO(FileIn)
+ Info = codecs.lookup(Encoding)
+ (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+ File = codecs.StreamReaderWriter(UniFile, Reader, Writer)
+ LineNumber = 0
+ ErrMsg = lambda Encoding, LineNumber: \
+ '%s contains invalid %s characters on line %d.' % \
+ (FileName, Encoding, LineNumber)
+ while True:
+ LineNumber = LineNumber + 1
+ try:
+ Line = File.readline()
+ if Line == '':
+ EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+ ErrMsg(Encoding, LineNumber))
+ Ucs2Info.encode(Line)
+ except:
+ EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+ ErrMsg('UCS-2', LineNumber))
+
#
# Get String name and value
#
@@ -305,7 +385,7 @@ class UniFileClassObject(object):
EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, ExtraData=File.Path)

try:
- FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16')
+ FileIn = self.OpenUniFile(LongFilePath(File.Path))
except UnicodeError, X:
EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File.Path);
except:
--
2.1.4


------------------------------------------------------------------------------
Laszlo Ersek
2015-06-01 09:46:53 UTC
Permalink
Post by Jordan Justen
Supplementary Plane characters can exist in UTF-16 files,
but they are not valid UCS-2 characters.
Post by Jordan Justen
import codecs
codecs.encode(u'\U00010300', 'utf-16')
'\xff\xfe\x00\xd8\x00\xdf'
Therefore the UCS-4 0x00010300 character is encoded as two
16-bit numbers (0xd800 0xdf00) in a little endian UTF-16
file.
http://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF
This means that our current BaseTools code could be allowing
unsupported UTF-16 characters be used. To fix this, we decode the file
using python's utf-16 decode support. Then we verify that each
character's code point is 0xffff or less.
v3: Based on Mike Kinney's feedback, we now read the whole file and
verify up-front that it contains valid UCS-2 characters. Thanks
also to Laszlo Ersek for pointing out the Supplementary Plane
characters.
Contributed-under: TianoCore Contribution Agreement 1.0
---
BaseTools/Source/Python/AutoGen/UniClassObject.py | 84 ++++++++++++++++++++++-
1 file changed, 82 insertions(+), 2 deletions(-)
diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py b/BaseTools/Source/Python/AutoGen/UniClassObject.py
index aa54f4f..66fdbf0 100644
--- a/BaseTools/Source/Python/AutoGen/UniClassObject.py
+++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py
@@ -19,6 +19,7 @@
import Common.LongFilePathOs as os, codecs, re
import distutils.util
import Common.EdkLogger as EdkLogger
+import StringIO
from Common.BuildToolError import *
from Common.String import GetLineNo
from Common.Misc import PathClass
EdkLogger.error("Unicode File Parser", FORMAT_INVALID, "Invalid RFC 4646 language code : %s" % LangName, File)
+## Ucs2Codec
+#
+# This is only a partial codec implementation. It only supports
+# encoding, and is primarily used to check that all the characters are
+# valid for UCS-2.
+#
+ self.__utf16 = codecs.lookup('utf-16')
+
+ raise ValueError("Code Point too large to encode in UCS-2")
+ return self.__utf16.encode(input)
I could be missing some context, but the wikipedia article referenced at
the top says that

The official Unicode standard says that no UTF forms [...] can encode
[U+D800 to U+DFFF] code points.

However UCS-2, UTF-8 [...] can encode [U+D800 to U+DFFF] code points
in trivial and obvious ways [...]

So basically it is possible to construct a UTF-8 file that (albeit
violating the unicode standard) will result in U+D800 to U+DFFF code
*points* when parsed. Then (I think) these would be turned into UCS2
code *units* with identical numeric values. Which, I think, would be wrong.

Also, the same seems to apply to a UTF-16 source:

It is possible to unambiguously encode them in UTF-16 by using a code
unit equal to the code point, as long as no sequence of two code
units can be interpreted as a legal surrogate pair (that is, as long
as a high surrogate is never followed by a low surrogate). The
majority of UTF-16 encoder and decoder implementations translate
between encodings as though this were the case.

Shouldn't we accept code points *only* in "U+0000 to U+D7FF and U+E000
to U+FFFF"? The article says

Both UTF-16 and UCS-2 encode code points in this range as single
16-bit code units that are numerically equal to the corresponding
code points. These code points in the BMP are the /only/ code points
that can be represented in UCS-2. Modern text almost exclusively
consists of these code points.

For example,

printf '%b' '\x01\xD8\x24\x00'

creates two UTF-16 code units (in LE representation):

D801 0024

This is not a valid surrogate pair (there is no low surrogate). Does
your code (or the underlying python machinery) reject the output of the
above printf invocation?

For example, "iconv" rejects it:

$ printf '%b' '\x01\xD8\x24\x00' | iconv -f UTF-16LE -t UCS-2
iconv: illegal input sequence at position 0
Post by Jordan Justen
+
+TheUcs2Codec = Ucs2Codec()
+ return codecs.CodecInfo(
+ name=name,
+ encode=TheUcs2Codec.encode,
+ decode=TheUcs2Codec.decode)
+ return None
+codecs.register(Ucs2Search)
+
## StringDefClassObject
#
# A structure for language definition
Lang = distutils.util.split_quoted((Line.split(u"//")[0]))
- FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16').read()
+ FileIn = self.OpenUniFile(LongFilePath(File.Path))
EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File);
self.OrderedStringDict[LangName][Item.StringName] = len(self.OrderedStringList[LangName]) - 1
return True
+ #
+ # Read file
+ #
+ UniFile = open(FileName, mode='rb')
+ FileIn = UniFile.read()
+ UniFile.close()
+ EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File)
+
+ #
+ # We currently only support UTF-16
+ #
+ Encoding = 'utf-16'
+
+ self.VerifyUcs2Data(FileIn, FileName, Encoding)
+
+ UniFile = StringIO.StringIO(FileIn)
+ Info = codecs.lookup(Encoding)
+ (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+ return codecs.StreamReaderWriter(UniFile, Reader, Writer)
+
+ Ucs2Info = codecs.lookup('ucs-2')
+ #
+ # Convert to unicode
+ #
+ FileDecoded = codecs.decode(FileIn, Encoding)
If this method call is guaranteed to throw an exception when it
encounters a UTF-16LE (or UTF-8) sequence that would decode to a
surrogate code point (ie. in the U+D800 to U+DFFF range), then we should
be fine.

Otherwise, such code points should be caught in Ucs2Codec.encode().

Thanks
Laszlo
Post by Jordan Justen
+ Ucs2Info.encode(FileDecoded)
+ UniFile = StringIO.StringIO(FileIn)
+ Info = codecs.lookup(Encoding)
+ (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+ File = codecs.StreamReaderWriter(UniFile, Reader, Writer)
+ LineNumber = 0
+ ErrMsg = lambda Encoding, LineNumber: \
+ '%s contains invalid %s characters on line %d.' % \
+ (FileName, Encoding, LineNumber)
+ LineNumber = LineNumber + 1
+ Line = File.readline()
+ EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+ ErrMsg(Encoding, LineNumber))
+ Ucs2Info.encode(Line)
+ EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+ ErrMsg('UCS-2', LineNumber))
+
#
# Get String name and value
#
EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, ExtraData=File.Path)
- FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16')
+ FileIn = self.OpenUniFile(LongFilePath(File.Path))
EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File.Path);
------------------------------------------------------------------------------
Jordan Justen
2015-06-01 23:09:40 UTC
Permalink
Post by Laszlo Ersek
Post by Jordan Justen
Supplementary Plane characters can exist in UTF-16 files,
but they are not valid UCS-2 characters.
Post by Jordan Justen
import codecs
codecs.encode(u'\U00010300', 'utf-16')
'\xff\xfe\x00\xd8\x00\xdf'
Therefore the UCS-4 0x00010300 character is encoded as two
16-bit numbers (0xd800 0xdf00) in a little endian UTF-16
file.
http://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF
This means that our current BaseTools code could be allowing
unsupported UTF-16 characters be used. To fix this, we decode the file
using python's utf-16 decode support. Then we verify that each
character's code point is 0xffff or less.
v3: Based on Mike Kinney's feedback, we now read the whole file and
verify up-front that it contains valid UCS-2 characters. Thanks
also to Laszlo Ersek for pointing out the Supplementary Plane
characters.
Contributed-under: TianoCore Contribution Agreement 1.0
---
BaseTools/Source/Python/AutoGen/UniClassObject.py | 84 ++++++++++++++++++++++-
1 file changed, 82 insertions(+), 2 deletions(-)
diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py b/BaseTools/Source/Python/AutoGen/UniClassObject.py
index aa54f4f..66fdbf0 100644
--- a/BaseTools/Source/Python/AutoGen/UniClassObject.py
+++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py
@@ -19,6 +19,7 @@
import Common.LongFilePathOs as os, codecs, re
import distutils.util
import Common.EdkLogger as EdkLogger
+import StringIO
from Common.BuildToolError import *
from Common.String import GetLineNo
from Common.Misc import PathClass
EdkLogger.error("Unicode File Parser", FORMAT_INVALID, "Invalid RFC 4646 language code : %s" % LangName, File)
+## Ucs2Codec
+#
+# This is only a partial codec implementation. It only supports
+# encoding, and is primarily used to check that all the characters are
+# valid for UCS-2.
+#
+ self.__utf16 = codecs.lookup('utf-16')
+
+ raise ValueError("Code Point too large to encode in UCS-2")
+ return self.__utf16.encode(input)
I could be missing some context, but the wikipedia article referenced at
the top says that
The official Unicode standard says that no UTF forms [...] can encode
[U+D800 to U+DFFF] code points.
However UCS-2, UTF-8 [...] can encode [U+D800 to U+DFFF] code points
in trivial and obvious ways [...]
So basically it is possible to construct a UTF-8 file that (albeit
violating the unicode standard) will result in U+D800 to U+DFFF code
*points* when parsed. Then (I think) these would be turned into UCS2
code *units* with identical numeric values. Which, I think, would be wrong.
It is possible to unambiguously encode them in UTF-16 by using a code
unit equal to the code point, as long as no sequence of two code
units can be interpreted as a legal surrogate pair (that is, as long
as a high surrogate is never followed by a low surrogate). The
majority of UTF-16 encoder and decoder implementations translate
between encodings as though this were the case.
Shouldn't we accept code points *only* in "U+0000 to U+D7FF and U+E000
to U+FFFF"? The article says
I think you are right. Under "U+D800 to U+DFFF" it says: "The Unicode
standard permanently reserves these code point values for UTF-16
encoding of the high and low surrogates, and they will never be
assigned a character, so there should be no reason to encode them."

So, I guess there is no valid reason to accept them as input code
points.

-Jordan
Post by Laszlo Ersek
Both UTF-16 and UCS-2 encode code points in this range as single
16-bit code units that are numerically equal to the corresponding
code points. These code points in the BMP are the /only/ code points
that can be represented in UCS-2. Modern text almost exclusively
consists of these code points.
For example,
printf '%b' '\x01\xD8\x24\x00'
D801 0024
This is not a valid surrogate pair (there is no low surrogate). Does
your code (or the underlying python machinery) reject the output of the
above printf invocation?
$ printf '%b' '\x01\xD8\x24\x00' | iconv -f UTF-16LE -t UCS-2
iconv: illegal input sequence at position 0
Post by Jordan Justen
+
+TheUcs2Codec = Ucs2Codec()
+ return codecs.CodecInfo(
+ name=name,
+ encode=TheUcs2Codec.encode,
+ decode=TheUcs2Codec.decode)
+ return None
+codecs.register(Ucs2Search)
+
## StringDefClassObject
#
# A structure for language definition
Lang = distutils.util.split_quoted((Line.split(u"//")[0]))
- FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16').read()
+ FileIn = self.OpenUniFile(LongFilePath(File.Path))
EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File);
self.OrderedStringDict[LangName][Item.StringName] = len(self.OrderedStringList[LangName]) - 1
return True
+ #
+ # Read file
+ #
+ UniFile = open(FileName, mode='rb')
+ FileIn = UniFile.read()
+ UniFile.close()
+ EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File)
+
+ #
+ # We currently only support UTF-16
+ #
+ Encoding = 'utf-16'
+
+ self.VerifyUcs2Data(FileIn, FileName, Encoding)
+
+ UniFile = StringIO.StringIO(FileIn)
+ Info = codecs.lookup(Encoding)
+ (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+ return codecs.StreamReaderWriter(UniFile, Reader, Writer)
+
+ Ucs2Info = codecs.lookup('ucs-2')
+ #
+ # Convert to unicode
+ #
+ FileDecoded = codecs.decode(FileIn, Encoding)
If this method call is guaranteed to throw an exception when it
encounters a UTF-16LE (or UTF-8) sequence that would decode to a
surrogate code point (ie. in the U+D800 to U+DFFF range), then we should
be fine.
Otherwise, such code points should be caught in Ucs2Codec.encode().
Thanks
Laszlo
Post by Jordan Justen
+ Ucs2Info.encode(FileDecoded)
+ UniFile = StringIO.StringIO(FileIn)
+ Info = codecs.lookup(Encoding)
+ (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+ File = codecs.StreamReaderWriter(UniFile, Reader, Writer)
+ LineNumber = 0
+ ErrMsg = lambda Encoding, LineNumber: \
+ '%s contains invalid %s characters on line %d.' % \
+ (FileName, Encoding, LineNumber)
+ LineNumber = LineNumber + 1
+ Line = File.readline()
+ EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+ ErrMsg(Encoding, LineNumber))
+ Ucs2Info.encode(Line)
+ EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+ ErrMsg('UCS-2', LineNumber))
+
#
# Get String name and value
#
EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, ExtraData=File.Path)
- FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16')
+ FileIn = self.OpenUniFile(LongFilePath(File.Path))
EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File.Path);
------------------------------------------------------------------------------
Kinney, Michael D
2015-06-08 03:44:16 UTC
Permalink
Jordan,

The functionality of the patch set looks good. Just a few comments below about needing some better comments.

Reviewed-by: Michael Kinney <***@intel.com>

We also discussed the need for an extra tool that can scan a workspace or a package for UNI files and convert them to UTF-16LE. This tool may be required when releases are generated to guarantee compatibility with other tools that operate on UNI files that require UNI files to be UTF-16LE. Do you think this helper tool should be implemented in BaseTools? We may want the same helper tool to convert to between any of the supported encodings so it can be used to convert current UTF-16LE to UTF-8 too.

Thanks,

Mike

-----Original Message-----
From: Justen, Jordan L
Sent: Monday, June 01, 2015 12:32 AM
To: edk2-***@lists.sourceforge.net
Cc: Justen, Jordan L; Liu, Yingke D; Kinney, Michael D; Laszlo Ersek
Subject: [PATCH v3 4/8] BaseTools/UniClassObject: Verify valid UCS-2 chars in UTF-16 .uni files

Supplementary Plane characters can exist in UTF-16 files,
but they are not valid UCS-2 characters.
Post by Jordan Justen
import codecs
codecs.encode(u'\U00010300', 'utf-16')
'\xff\xfe\x00\xd8\x00\xdf'

Therefore the UCS-4 0x00010300 character is encoded as two
16-bit numbers (0xd800 0xdf00) in a little endian UTF-16
file.

For more information, see:
http://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF

This means that our current BaseTools code could be allowing
unsupported UTF-16 characters be used. To fix this, we decode the file
using python's utf-16 decode support. Then we verify that each
character's code point is 0xffff or less.

v3: Based on Mike Kinney's feedback, we now read the whole file and
verify up-front that it contains valid UCS-2 characters. Thanks
also to Laszlo Ersek for pointing out the Supplementary Plane
characters.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
Cc: Michael D Kinney <***@intel.com>
Cc: Laszlo Ersek <***@redhat.com>
---
BaseTools/Source/Python/AutoGen/UniClassObject.py | 84 ++++++++++++++++++++++-
1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py b/BaseTools/Source/Python/AutoGen/UniClassObject.py
index aa54f4f..66fdbf0 100644
--- a/BaseTools/Source/Python/AutoGen/UniClassObject.py
+++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py
@@ -19,6 +19,7 @@
import Common.LongFilePathOs as os, codecs, re
import distutils.util
import Common.EdkLogger as EdkLogger
+import StringIO
from Common.BuildToolError import *
from Common.String import GetLineNo
from Common.Misc import PathClass
@@ -147,6 +148,33 @@ def GetLanguageCode(LangName, IsCompatibleMode, File):

EdkLogger.error("Unicode File Parser", FORMAT_INVALID, "Invalid RFC 4646 language code : %s" % LangName, File)

+## Ucs2Codec
+#
+# This is only a partial codec implementation. It only supports
+# encoding, and is primarily used to check that all the characters are
+# valid for UCS-2.
+#
+class Ucs2Codec(codecs.Codec):
+ def __init__(self):
+ self.__utf16 = codecs.lookup('utf-16')
+
+ def encode(self, input, errors='strict'):
+ for Char in input:
+ if ord(Char) > 0xffff:
+ raise ValueError("Code Point too large to encode in UCS-2")
+ return self.__utf16.encode(input)
+
+TheUcs2Codec = Ucs2Codec()

This is creating a global object in this module for the USC-2 codec. Needs a comment to describe this.

+def Ucs2Search(name):
+ if name == 'ucs-2':
+ return codecs.CodecInfo(
+ name=name,
+ encode=TheUcs2Codec.encode,
+ decode=TheUcs2Codec.decode)
+ else:
+ return None
+codecs.register(Ucs2Search)

This is registering the new UCS-2 codec with the codecs module when this module is imported. Need a comment to describe this.

+
## StringDefClassObject
#
# A structure for language definition
@@ -209,7 +237,7 @@ class UniFileClassObject(object):
Lang = distutils.util.split_quoted((Line.split(u"//")[0]))
if len(Lang) != 3:
try:
- FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16').read()
+ FileIn = self.OpenUniFile(LongFilePath(File.Path))
except UnicodeError, X:
EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File);
except:
@@ -253,6 +281,58 @@ class UniFileClassObject(object):
self.OrderedStringDict[LangName][Item.StringName] = len(self.OrderedStringList[LangName]) - 1
return True

+ def OpenUniFile(self, FileName):
+ #
+ # Read file
+ #
+ try:
+ UniFile = open(FileName, mode='rb')
+ FileIn = UniFile.read()
+ UniFile.close()
+ except:
+ EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File)
+
+ #
+ # We currently only support UTF-16
+ #
+ Encoding = 'utf-16'

This comment is confusing because the patch allows UNI files to be in different encodings. Is this really referring to the encoding used internally to manage content read from an external file?

+
+ self.VerifyUcs2Data(FileIn, FileName, Encoding)
+
+ UniFile = StringIO.StringIO(FileIn)
+ Info = codecs.lookup(Encoding)
+ (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+ return codecs.StreamReaderWriter(UniFile, Reader, Writer)
+
+ def VerifyUcs2Data(self, FileIn, FileName, Encoding):
+ Ucs2Info = codecs.lookup('ucs-2')
+ #
+ # Convert to unicode
+ #
+ try:
+ FileDecoded = codecs.decode(FileIn, Encoding)
+ Ucs2Info.encode(FileDecoded)
+ except:
+ UniFile = StringIO.StringIO(FileIn)
+ Info = codecs.lookup(Encoding)
+ (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+ File = codecs.StreamReaderWriter(UniFile, Reader, Writer)
+ LineNumber = 0
+ ErrMsg = lambda Encoding, LineNumber: \
+ '%s contains invalid %s characters on line %d.' % \
+ (FileName, Encoding, LineNumber)
+ while True:
+ LineNumber = LineNumber + 1
+ try:
+ Line = File.readline()
+ if Line == '':
+ EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+ ErrMsg(Encoding, LineNumber))
+ Ucs2Info.encode(Line)
+ except:
+ EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+ ErrMsg('UCS-2', LineNumber))
+
#
# Get String name and value
#
@@ -305,7 +385,7 @@ class UniFileClassObject(object):
EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, ExtraData=File.Path)

try:
- FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16')
+ FileIn = self.OpenUniFile(LongFilePath(File.Path))
except UnicodeError, X:
EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File.Path);
except:
--
2.1.4


------------------------------------------------------------------------------
Jordan Justen
2015-06-08 06:13:22 UTC
Permalink
Post by Jordan Justen
+
+TheUcs2Codec = Ucs2Codec()
This is creating a global object in this module for the USC-2 codec.
Needs a comment to describe this.
Ok.

How about:

## Instance of Ucs2Codec class
#
# This object is used to support a codec for UCS-2 encoding
#
# The Ucs2Search function uses this object when a search for the usc-2
# codec is requested.
#
Post by Jordan Justen
+ #
+ # We currently only support UTF-16
+ #
+ Encoding = 'utf-16'
This comment is confusing because the patch allows UNI files to be
in different encodings. Is this really referring to the encoding
used internally to manage content read from an external file?
In two patches later, "BaseTools/UniClassObject: Support UTF-8 string
data in .uni files" (patch 6/8 for v3 and 06/10 for v4), I update this
comment to:

"Detect Byte Order Mark at beginning of file. Default to UTF-8"

That is the patch where we start to accept more than just UTF-16 input
files.

With the comment above added, and this explaination, do you think the
patch is good enough to add your Reviewed-by signature? How about the
other patches in the series?

Thanks for your time,

-Jordan

------------------------------------------------------------------------------
Kinney, Michael D
2015-06-08 15:01:42 UTC
Permalink
Jordan,

Yes. With that one comment added, the entire series looks good to me..

Reviewed-by: Michael Kinney <***@intel.com>

Mike

-----Original Message-----
From: Justen, Jordan L
Sent: Sunday, June 07, 2015 11:13 PM
To: Kinney, Michael D; edk2-***@lists.sourceforge.net
Cc: Liu, Yingke D; Laszlo Ersek
Subject: RE: [PATCH v3 4/8] BaseTools/UniClassObject: Verify valid UCS-2 chars in UTF-16 .uni files
Post by Jordan Justen
+
+TheUcs2Codec = Ucs2Codec()
This is creating a global object in this module for the USC-2 codec.
Needs a comment to describe this.
Ok.

How about:

## Instance of Ucs2Codec class
#
# This object is used to support a codec for UCS-2 encoding
#
# The Ucs2Search function uses this object when a search for the usc-2
# codec is requested.
#
Post by Jordan Justen
+ #
+ # We currently only support UTF-16
+ #
+ Encoding = 'utf-16'
This comment is confusing because the patch allows UNI files to be
in different encodings. Is this really referring to the encoding
used internally to manage content read from an external file?
In two patches later, "BaseTools/UniClassObject: Support UTF-8 string
data in .uni files" (patch 6/8 for v3 and 06/10 for v4), I update this
comment to:

"Detect Byte Order Mark at beginning of file. Default to UTF-8"

That is the patch where we start to accept more than just UTF-16 input
files.

With the comment above added, and this explaination, do you think the
patch is good enough to add your Reviewed-by signature? How about the
other patches in the series?

Thanks for your time,

-Jordan
------------------------------------------------------------------------------
Jordan Justen
2015-06-01 07:31:38 UTC
Permalink
Supplementary Plane characters can exist in UTF-16 files,
but they are not valid UCS-2 characters.
Post by Jordan Justen
import codecs
codecs.encode(u'\U00010300', 'utf-16')
'\xff\xfe\x00\xd8\x00\xdf'

Therefore the UCS-4 0x00010300 character is encoded as two
16-bit numbers (0xd800 0xdf00) in a little endian UTF-16
file.

For more information, see:
http://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF

This test checks to make sure that BaseTools will reject these
characters in UTF-16 files.

This test was fixed by the previous commit:
"BaseTools/UniClassObject: Verify valid UCS-2 chars in UTF-16 .uni files"

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
Cc: Michael D Kinney <***@intel.com>
Cc: Laszlo Ersek <***@redhat.com>
---
BaseTools/Tests/CheckUnicodeSourceFiles.py | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/BaseTools/Tests/CheckUnicodeSourceFiles.py b/BaseTools/Tests/CheckUnicodeSourceFiles.py
index 0083ad8..39fd2fe 100644
--- a/BaseTools/Tests/CheckUnicodeSourceFiles.py
+++ b/BaseTools/Tests/CheckUnicodeSourceFiles.py
@@ -81,6 +81,21 @@ class Tests(TestTools.BaseToolsTest):
def testUtf16InUniFile(self):
self.CheckFile('utf_16', shouldPass=True)

+ def testSupplementaryPlaneUnicodeCharInUtf16File(self):
+ #
+ # Supplementary Plane characters can exist in UTF-16 files,
+ # but they are not valid UCS-2 characters.
+ #
+ # This test makes sure that BaseTools rejects these characters
+ # if seen in a .uni file.
+ #
+ data = u'''
+ #langdef en-US "English"
+ #string STR_A #language en-US "CodePoint (\U00010300) > 0xFFFF"
+ '''
+
+ self.CheckFile('utf_16', shouldPass=False, string=data)
+
TheTestSuite = TestTools.MakeTheTestSuite(locals())

if __name__ == '__main__':
--
2.1.4


------------------------------------------------------------------------------
Laszlo Ersek
2015-06-01 09:48:43 UTC
Permalink
Post by Jordan Justen
Supplementary Plane characters can exist in UTF-16 files,
but they are not valid UCS-2 characters.
Post by Jordan Justen
import codecs
codecs.encode(u'\U00010300', 'utf-16')
'\xff\xfe\x00\xd8\x00\xdf'
Therefore the UCS-4 0x00010300 character is encoded as two
16-bit numbers (0xd800 0xdf00) in a little endian UTF-16
file.
http://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF
This test checks to make sure that BaseTools will reject these
characters in UTF-16 files.
"BaseTools/UniClassObject: Verify valid UCS-2 chars in UTF-16 .uni files"
Contributed-under: TianoCore Contribution Agreement 1.0
---
BaseTools/Tests/CheckUnicodeSourceFiles.py | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/BaseTools/Tests/CheckUnicodeSourceFiles.py b/BaseTools/Tests/CheckUnicodeSourceFiles.py
index 0083ad8..39fd2fe 100644
--- a/BaseTools/Tests/CheckUnicodeSourceFiles.py
+++ b/BaseTools/Tests/CheckUnicodeSourceFiles.py
self.CheckFile('utf_16', shouldPass=True)
+ #
+ # Supplementary Plane characters can exist in UTF-16 files,
+ # but they are not valid UCS-2 characters.
+ #
+ # This test makes sure that BaseTools rejects these characters
+ # if seen in a .uni file.
+ #
+ data = u'''
+ #langdef en-US "English"
+ #string STR_A #language en-US "CodePoint (\U00010300) > 0xFFFF"
+ '''
+
+ self.CheckFile('utf_16', shouldPass=False, string=data)
+
TheTestSuite = TestTools.MakeTheTestSuite(locals())
I'd propose to extend this with a test case that feeds binary data (not
unicode text) to the checker, and the data should look similar to the
printf example in my previous comment.

Thanks
Laszlo

------------------------------------------------------------------------------
Jordan Justen
2015-06-01 07:31:40 UTC
Permalink
Since UTF-8 .uni unicode files might contain strings with unicode code
points larger than 16-bits, and UEFI only supports UCS-2 characters,
we need to make sure that BaseTools rejects these characters in UTF-8
.uni source files.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
---
BaseTools/Tests/CheckUnicodeSourceFiles.py | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/BaseTools/Tests/CheckUnicodeSourceFiles.py b/BaseTools/Tests/CheckUnicodeSourceFiles.py
index 39fd2fe..1b17377 100644
--- a/BaseTools/Tests/CheckUnicodeSourceFiles.py
+++ b/BaseTools/Tests/CheckUnicodeSourceFiles.py
@@ -96,6 +96,31 @@ class Tests(TestTools.BaseToolsTest):

self.CheckFile('utf_16', shouldPass=False, string=data)

+ def test32bitUnicodeCharInUtf8File(self):
+ data = u'''
+ #langdef en-US "English"
+ #string STR_A #language en-US "CodePoint (\U00010300) > 0xFFFF"
+ '''
+
+ self.CheckFile('utf_16', shouldPass=False, string=data)
+
+ def test32bitUnicodeCharInUtf8File(self):
+ data = u'''
+ #langdef en-US "English"
+ #string STR_A #language en-US "CodePoint (\U00010300) > 0xFFFF"
+ '''
+
+ self.CheckFile('utf_8', shouldPass=False, string=data)
+
+ def test32bitUnicodeCharInUtf8Comment(self):
+ data = u'''
+ // Even in comments, we reject non-UCS-2 chars: \U00010300
+ #langdef en-US "English"
+ #string STR_A #language en-US "A"
+ '''
+
+ self.CheckFile('utf_8', shouldPass=False, string=data)
+
TheTestSuite = TestTools.MakeTheTestSuite(locals())

if __name__ == '__main__':
--
2.1.4


------------------------------------------------------------------------------
Jordan Justen
2015-06-01 07:31:39 UTC
Permalink
This allows .uni input files to be encoded with UTF-8. Today, we only
support UTF-16 encoding.

The strings are still converted to UCS-2 data for use in EDK II
modules. (This is the only unicode character format supported by UEFI
and EDK II.)

Although UTF-8 would allow any UCS-4 character to be present in the
source file, we restrict the entire file to the UCS-2 range.
(Including comments.) This allows the files to be converted to UTF-16
if needed.

v2:
* Drop .utf8 extension. Use .uni file for UTF-8 data (mdkinney)
* Merge in 'BaseTools/UniClassObject: Verify string data is 16-bit'
commit

v3:
* Restrict the entire file's characters (including comments) to the
UCS-2 range in addition to string data. (mdkinney)

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Cc: Yingke D Liu <***@intel.com>
Cc: Michael D Kinney <***@intel.com>
---
BaseTools/Source/Python/AutoGen/UniClassObject.py | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py b/BaseTools/Source/Python/AutoGen/UniClassObject.py
index 66fdbf0..6ccaef0 100644
--- a/BaseTools/Source/Python/AutoGen/UniClassObject.py
+++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py
@@ -293,9 +293,12 @@ class UniFileClassObject(object):
EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File)

#
- # We currently only support UTF-16
+ # Detect Byte Order Mark at beginning of file. Default to UTF-8
#
- Encoding = 'utf-16'
+ Encoding = 'utf-8'
+ if (FileIn.startswith(codecs.BOM_UTF16_BE) or
+ FileIn.startswith(codecs.BOM_UTF16_LE)):
+ Encoding = 'utf-16'

self.VerifyUcs2Data(FileIn, FileName, Encoding)
--
2.1.4


------------------------------------------------------------------------------
Jordan Justen
2015-06-01 07:31:41 UTC
Permalink
This command was used to convert the file:
iconv -f UTF-16 -t UTF-8 \
-o OvmfPkg/PlatformDxe/Platform.uni \
OvmfPkg/PlatformDxe/Platform.uni

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <***@intel.com>
Reviewed-by: Laszlo Ersek <***@redhat.com>
---
OvmfPkg/PlatformDxe/Platform.uni | Bin 3298 -> 1648 bytes
1 file changed, 0 insertions(+), 0 deletions(-)

diff --git a/OvmfPkg/PlatformDxe/Platform.uni b/OvmfPkg/PlatformDxe/Platform.uni
index d8d5b0bb4de9c0632dc183369fa7d46f455e9143..6df865519f35e302430f1ad290729011fd4c9048 100644
GIT binary patch
literal 1648
zcmai!QE$^Q5Xaw<_ztJ-1=yAX;)My2($-@UYg#2KjHk*?E{UbYj%=sf`1G9JEU<Qh
zN>!Wq?&rV%zPrf;-khDid@-50FU(Z;phZ<%cr|+***@87=ra1IF;aLw&GL^2N!qjDGZ
z_M=<0*igRil;&0_89>-H;9&+d8q_;1f=|=r%eY%s3j>{2mF6vQS%9q(c%G}a<MKhs
z3R-Sa3*H#u8le$6N<(s7Y|9G@-f_|JZG&D{FALNjLRl^4P*|>HA)Foqs`P8qbPhLr
z65Q1yug5I~8j29c!wO-n7TbP*mW-***@Jsrs6y?rDNdPvFxY-wGQ0N~cA*VcBIlZom`
zvFde>fzs7v$S{+wDK3VGpsTw-mRvJfCCjf#xPT~yd6Z^JG+k$G4(oW%638gPpCFpC
zIAySmA&lW9Oey>XrePYT=U%{%D7#*+Gx!lENf>7lOJSKn!d3}OS)7Ggw2bN16{Y`#
zZ&5ry2SzEh1-***@IF5H8n#p)(vbAR>z#X=Q*gAnr;FGt}3tA^WB={D%47+;55a*^lu
***@4%FNrMoS#6mqy4${X{qh+%?VsYl4g#TzP5di;Fqeoh-ME6N6x7wZ&Gn5-IM>Jz`_
zE{))1+vaMSEK-(***@FnoM+&ntY)UVdL(jeAo8%TiSRzJ!T*`V8-y-K-vQaKxL{
zqz+-nwNSkQkM9O+***@W7x&***@84pN(4LhJhyw!M*k-m)MqU2T5n-jVM6IcX
z5;Usu#Z46pv8(E-QucZ3=HEwl&C!iy>h(wR&~***@2Js?bC=erZ7oQ*<Cb;|AhOW<
znL|C)Q-9N_p>n_-N)U3#8&SZk8fA_kDyc~PYq-;t=***@6sW*;ixI!@&{k(k%x<$M
zt{j;%+}N|;xWfFGIgb9sog?~B)kDrDm=1GMJ^-hq-wk&8L%A****@yQDXqkTq<u1
zG%HKd&}4H1u;N0}yZ>8|hTqwS-9}G5I1)KiZ7VG7o!Adfb}tUAfu_+cOy*B<N36Lp
MIoHK=u$M1>1Ja-Rz5oCK

literal 3298
zcmb`J+fP$L5XR@(#Q)(0Uc3Qne37ULm!geA2`ws5O<UU1q!-dtK>Xv?-#6Pmrx)l^
zVw#@4&CYx~^Udu3{<***@0U&hN23VziaDiS60QF74LF*0Zi%*aX=p%s!=kZ7=PW
zy|EYcPpoa{w4bbjjAqI<w3cla>S|@WUfSDi=LCIyw8;J#1o}#IkKPciiS4j1i5yw#
zqqe2&9ow}d+O&7J3%U;(j(EnzYMiHsbb|***@68z_@`8oo0eW+s7@=GO_`ZTDxW6
z5cz}|p_08Gy}oN}Fw(9*b1iy9M<jhwuXdeJHDA;3A=|Jf-*$#Gu`5R*-8qtAYcRDF
zPzsrPz05y4)5tnA`*y8r`;5QLVM^@AebC}7bn~a|fkv9-1^FrWoNT4c>(otf&c;Pv
z)#****@Jt4`dcG&HE?#@)oJ)bpL4T0U?{rSQiT?L}J^yDt-nZOMdJg-{kaT;<L37
zgOr&j$***@zKVPz977yRQHV=I<?vJ$9{VNu0C^4+mR#$`O3;8V3VX3OyGlwQgGd}>Fu
zCFXMtc?`<%x`ag4HCO0~&-$*>O0Tr8(eC<KqZb*r#J;gETot#***@eY=R?1=d?TBW1n
z)=wA-yXIuGhLmJvHAo#***@GpIH%dWpHU~C7hyU-!nZ?d+wz&J~@V6vb{f(^|{0$
zF*$cmh#***@s;vs2}g3G2tiM^W#1BnN0zHuUb0-tmZvi{kI(UX4}O#Mh9w%D#?|
z=CdZ)Teg2N#gT&Knw{f2kZi!JXVdB#PV?***@T~4EQKas2l+ljMwg9yr-YxqrT%83*v
zuFdR=nHG=-w$<&x<9p~)ty$e;S|bAD>ijRae1(s?*j=<;v9H8WJXrL5sK7(_RoC+?
zufhT2dR7Tm<5L!WLH|59jf(N@;Str&fL#?+kyO5??9b6(F0dwHy-nOzpR`-iTaEgV
zOi%{o{=Lt$#i(;!)ddu*F@#*LQzK42gO@!PXNoZ#<0&y+w}^VWg>+Y^c(0HDV&x7s
zR(Tm~^)3=4*8c@--D~ATqa1Gz-NuWUHM5L){*L>|zJ}Lv-MeCjUJtr`byl%n+)bE?
zXKYT-sP{zK*9I^pzH`Doq`DJWq?)LEMcJd*#gLwOB|a)|<=ZEI`|0H)d(t)***@UNX
zOgHSGz{31SzxJJNq!gKOx1`MdL_-l<BU6sZt?yKT$cyE+m?`r)_V>Jr74xCYiP-Pw
zPBHc~ym!UvTSP)pN&QXG!iyc=YKnV^`VH+YyA;C^sd$R`;w$EoUO8f=Vi}nms<${6
z-j%vx`oud(h8JKLG-8da`u|};71n%(0d;4AUIxqY4>QG{RK^VW=~h(!<aJm16yL8U
k-gA;zT^qvNXb*HJya`sJE5@~tz0~77_B{JLWV(0%07EhSy8r+H
--
2.1.4


------------------------------------------------------------------------------
Loading...