No Unicode mapping error when extract texts from pdf document by pdfbox for missing ToUnicode CMap entry in font dict

Question

Adobe Acrobat Pro "Content View" display character normal, but when i copy and paste, they are invalid.but if "copy with formating",it will be normal.bad case image

eg the first letter"重"，bad case pdf file when i use pdfbox to extract letters,some warning alert.

一月 08, 2021 11:14:37 上午 org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
警告: No Unicode mapping for CID+18429 (18429) in font GVAQVQ+SimSun

PDFont.loadUnicodeCmap() for there is no ToUnicode CMap entry in font GVAQVQ+SimSun,so PDType0Font.toUnicodeCMap is null. so when call PDFont.toUnicode(),it return null.

@mkl If there are some way to sovle this problem.Thanks in advance.

PDType0Font/null, PostScript name: GVAQVQ+SimSun

   0 = {SmallMap$SmallMapEntry@2123} "COSName{BaseFont}" -> "COSName{GVAQVQ+SimSun}"
   1 = {SmallMap$SmallMapEntry@2124} "COSName{DescendantFonts}" -> "COSArray{[COSDictionary{COSName{BaseFont}:COSName{GVAQVQ+SimSun};COSName{CIDSystemInfo}:COSDictionary{COSName{Ordering}:COSString{Identity};COSName{Registry}:COSString{PDFXC30};COSName{Supplement}:COSInt{0};};COSName{DW}:COSInt{1000};COSName{FontDescriptor}:COSObject{COSDictionary{COSName{Ascent}:COSInt{859};COSName{AvgWidth}:COSInt{500};COSName{CapHeight}:COSInt{668};COSName{Descent}:COSInt{-141};COSName{Flags}:COSInt{32};COSName{FontBBox}:COSArray{COSInt{-8};COSInt{-145};1000;859;};COSName{FontFile2}:COSObject{COSDictionary{COSName{Length}:COSInt{175201};COSName{Filter}:COSArray{COSName{FlateDecode};};COSName{Length1}:COSInt{468544};}COSStream{-708342007}};COSName{FontName}:-120083354;COSName{ItalicAngle}:0;COSName{Leading}:COSInt{141};COSName{MaxWidth}:1000;COSName{MissingWidth}:500;COSName{StemH}:COSInt{70};COSName{StemV}:70;COSName{Type}:COSName{FontDescriptor};COSName{XHeight}:COSInt{438};}};COSName{Subtype}:COSName{CIDFontType2};COSName{Type}:COSNa
   2 = {SmallMap$SmallMapEntry@2125} "COSName{Encoding}" -> "COSName{Identity-H}"
   3 = {SmallMap$SmallMapEntry@2126} "COSName{Subtype}" -> "COSName{Type0}"
   4 = {SmallMap$SmallMapEntry@2127} "COSName{Type}" -> "COSName{Font}"

"COSName{FontDescriptor}" -> "COSObject{15, 0}"
key = {COSName@2168} "COSName{FontDescriptor}"
value = {COSObject@2169} "COSObject{15, 0}"
baseObject = {COSDictionary@2209} "COSDictionary{COSName{Ascent}:COSInt{859};COSName{AvgWidth}:COSInt{500};COSName{CapHeight}:COSInt{668};COSName{Descent}:COSInt{-141};COSName{Flags}:COSInt{32};COSName{FontBBox}:COSArray{COSInt{-8};COSInt{-145};COSInt{1000};859;};COSName{FontFile2}:COSObject{COSDictionary{COSName{Length}:COSInt{175201};COSName{Filter}:COSArray{COSName{FlateDecode};};COSName{Length1}:COSInt{468544};}COSStream{-708342007}};COSName{FontName}:COSName{GVAQVQ+SimSun};COSName{ItalicAngle}:COSInt{0};COSName{Leading}:COSInt{141};COSName{MaxWidth}:1000;COSName{MissingWidth}:500;COSName{StemH}:COSInt{70};COSName{StemV}:70;COSName{Type}:COSName{FontDescriptor};COSName{XHeight}:COSInt{438};}"
needToBeUpdated = false
items = {SmallMap@2211}  size = 16
0 = {SmallMap$SmallMapEntry@2214} "COSName{Ascent}" -> "COSInt{859}"
1 = {SmallMap$SmallMapEntry@2215} "COSName{AvgWidth}" -> "COSInt{500}"
2 = {SmallMap$SmallMapEntry@2216} "COSName{CapHeight}" -> "COSInt{668}"
3 = {SmallMap$SmallMapEntry@2217} "COSName{Descent}" -> "COSInt{-141}"
4 = {SmallMap$SmallMapEntry@2218} "COSName{Flags}" -> "COSInt{32}"
5 = {SmallMap$SmallMapEntry@2219} "COSName{FontBBox}" -> "COSArray{[COSInt{-8}, COSInt{-145}, COSInt{1000}, COSInt{859}]}"
6 = {SmallMap$SmallMapEntry@2220} "COSName{FontFile2}" -> "COSObject{12, 0}"
7 = {SmallMap$SmallMapEntry@2221} "COSName{FontName}" -> "COSName{GVAQVQ+SimSun}"
8 = {SmallMap$SmallMapEntry@2222} "COSName{ItalicAngle}" -> "COSInt{0}"
9 = {SmallMap$SmallMapEntry@2223} "COSName{Leading}" -> "COSInt{141}"
10 = {SmallMap$SmallMapEntry@2224} "COSName{MaxWidth}" -> "COSInt{1000}"
11 = {SmallMap$SmallMapEntry@2225} "COSName{MissingWidth}" -> "COSInt{500}"
12 = {SmallMap$SmallMapEntry@2226} "COSName{StemH}" -> "COSInt{70}"
13 = {SmallMap$SmallMapEntry@2227} "COSName{StemV}" -> "COSInt{70}"
14 = {SmallMap$SmallMapEntry@2228} "COSName{Type}" -> "COSName{FontDescriptor}"
15 = {SmallMap$SmallMapEntry@2229} "COSName{XHeight}" -> "COSInt{438}"

mkl mkl · Accepted Answer · 2021-01-08T10:15:56

PDFBox text extraction works according to Algorithm presented in section 9.10.2 "Mapping Character Codes to Unicode Values" of the PDF specification ISO 32000-1. When trying to apply this algorithm to your file, it fails to extract the text drawn with the SimSun font embedded subset (F2):

"If the font dictionary contains a ToUnicode CMap" - F2 does not have a ToUnicode CMap.
"If the font is a simple font" - F2 is not a simple font.
"If the font is a composite font" - F2 indeed is a composite font, but ...
- "that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V)" - F2 uses Identity-H.
- "or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection" - F2 uses the PDFXC30-Identity character collection.
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

Thus, text extraction as implemented in PDFBox cannot extract that Chinese text.

An alternative source for text information during text extraction presented in the PDF specification are ActualText entries for structure elements or marked-content sequences. But your PDF does not have any such ActualText entries either.

Thus, Adobe Acrobat copy&paste (which uses a combination of the algorithm mentioned before and ActualText analysis) cannot extract that Chinese text.

So "copy with formating" in Adobe Acrobat Pro apparently must use some information beyond those mechanisms proposed by the PDF specification.

Inspecting the embedded font resource itself one can see that it neither contains own mappings to Unicode nor any standard names. It is notable, though, that the glyph numbers are not consecutively numbered but have gaps. Probably these numbers have been retained from the full font during subsetting.

Adobe Acrobat Pro, therefore, appears to do either of the following options during "copy with formating" of your Chinese text:

They know the details of the PDFXC30-Identity character collection, either officially from PDF-XChange or by reverse-engineering, and extract using that information.
(If the assumption is correct that glyph numbers have been retained from the full font during subsetting:) They know the SimSun font and have a glyph number to Unicode mapping to use for extraction.
They take a full copy of the SimSun font (either provided internally or by the host OS), compare the glyphs therein with those in the embedded subset, and derive a mapping to Unicode from that for text extraction.
They apply OCR to the individual glyphs of the embedded font and derive a mapping to Unicode from the results.

Googling around for the PDFXC30-Identity character collection one sees that there are numerous text extraction tools having issues with it, e.g. on the Aspose forums one can read:

Our team has looked into this issue and I would like to share with you that the software you used to create the sample PDF files used PDFXC30 character collection. This character collection is not standard and we don’t have any information about this encoding. This makes correct text extraction impossible at the moment.

(shahzadlatif most recent response in the PdfExtractor encoding issue thread)

If you can provide PDFXC30 character collection mapping files from a trustable source, PDFBox development may include them into PDFBox to enable text extraction for files like yours.

No Unicode mapping error when extract texts from pdf document by pdfbox for missing ToUnicode CMap entry in font dict

1 Answers