blob: b9bb51e021762e75ce0216bb05e66344014a505d [file] [log] [blame]
This directory contains the source code of ICU 4.2.1 for C/C++
1. It was obtained with the following:
$ svn export --native-eol LF http://source.icu-project.org/repos/icu/icu/tags/release-4-2-1 icu42
2. The following directories were removed because they're not used by Chromium
at the moment:
as_is
packaging
source/extra
source/sample
source/layout
source/layoutex
3. Platform header files for Linux and Mac OS X:
On Linux and Mac OS X, 'runConfigureICU Linux' and 'runConfigureICU MacOSX'
are run to generate source/common/unicode/platform.h.
Rename it to 'plinux.h' and 'pmac.h' on Linux and Mac and check them in.
The Mac 'pmac.h' file needs to have patches/pmac.h.patch applied.
Change source/common/unicode/umachine.h to refer to plinux.h and pmac.h
on Linux and Mac, respetively.
4. To avoid name collisions (two different versions of StringPiece
are in Chrome's base and ICU), make the use of 'icu::' namespace
qualifier required by setting U_USING_ICU_NAMESPACE to 0 in
source/common/unicode/uversion.h
In addition, the patches for ICU ticket 6935
(http://icu-project.org/trac/ticket/6935) are applied.
The combined patch is patches/namespace.patch.txt
5. The word breaking for Chinese and Japanese were modified to use a word
frequency list with the following patch and cjdict.txt.
In addition, the word breaking rule for ASCII and full-width full stop(period)
surrounded by letters has been modified to fit our need for segmenting
a host name into its components (e.g. treating 'www.google.com' not as
a single word but as 5 words). It's what ICU 3.8 did before UTR 29
changed the rule (WB #6, #7). This also let us pass
LayoutTests/css1/text_properties/text_transform.html without rebaselining.
These patches alone will not work without build-related changes mentioned
in #10 below.
- patches/segmentation.patch.txt :
Adds a dictionary (word-frequency)-based word breaking for CJK
(Korean is supported in the code, but it does not do anything
because we don't have a Korean word-list.)
- source/data/brkitr/cjdict.txt :
Chinese and Japanese word frequency list.
See the file for license/copyright notice
- source/data/brkitr/cc_edict.txt :
the list of words derived from CC-Edict.)
The following two files were removed (because Japanese breaking rules
are now the same as that of other langauges).
- source/data/brkitr/word_ja.txt
- source/data/brkitr/ja.txt
If you want to run ICU tests, you have to copy source/data/brkitr/cjdict.txt
to source/test/testdata/cjdict-truncated.txt to pass TestTrieWithValue test.
6. A minor break iterator change
- patches/brkitr.patch.txt
7. Converter changes : converters.patch.txt
- Include what we really need. See source/data/mappings/ucmlocal.txt
- Alias and mapping changes : source/data/mappings/convrtrs.txt
- Changes several tables and add six new tables, three of which
are 'fake' tables for ISO-2022-CN(-Ext).
- ucnv2022.c is modified to use 3 'fake' tables added above for
ISO-2022-CN(-Ext).
8. Locale changes
- patches/locale1.patch.txt :
Filipino locale, exemplar character set changes for CJK + 9 Indian
locales with minor fixes for Danish, Hungarian, Turkish, Korean
and Catalan.
- patches/locale2.patch.txt :
The minimum locale data Chrome needs for 35 languages Chrome is
not localized to. Each locale data file has ExemplarCharacters,
LocaleScript, layout, and the name of the language for a locale
in its native language.
- patches/locale3.patch.txt : Locale build configuration files
9. Removal of unihan collation tables from data/coll/{zh,ja,ko}.txt
- patches/unihan.patch.txt:
unihan collation tables are never used in Chrome/Webkit, but it takes
about 1MB in the uncompressed ICU data file in ICU 4.2.1.
10. Build-related changes
- patches/wpo.patch
- patches/windows.patch
- patches/data.build.patch :
To remove some data files we don't use and cut down the data size.
- patches/data.build.win.patch :
Windows-only data build patch. Add a new target DATALIB to makedata.mak
- add an empty file (stubdatabuilt.txt) to source/stubdata
11. Pre-built data libraries are checked in.
- source/data/in/icudt42l.dat : Built on Linux with all the patches
above applied,
- icudt42.dll : With icudt42l.dat in place, all the patches applied
and header files moved (#11 below), generated in bin/ by building
icudt_build project of build/icudt_build.sln on Windows.
It's made in bin/ and moved to the top and checked in.
- {mac,linux}/icudt42l_dat.s : Built on Mac and Linux with all the
patches above applied and checked in.
linux needs the '@' in the preamble changed to '%'. See
http://codereview.chromium.org/215026.
mac/icudt42l_dat.s needs one line added after it is generated. A
.private_extern directive needs to be added so that the top of the
file looks like:
.globl _icudt42_dat
.private_extern _icudt42_dat
.data
12. The header files were moved as shown below:
source/common/unicode ==> public/common/unicode
source/i18n/unicode ==> public/i18n/unicode
13. The patch for a memory leak in i18n/timezone.cpp (Windows only):
see http://bugs.icu-project.org/trac/ticket/7135
- patches/tzmemory.patch
14. The patch for a crash in common/putil.c (Linux only):
see http://bugs.icu-project.org/trac/ticket/7177
- patches/linuxtz.patch
15. The patch for Linux locale detection
- patches/locdet.patch