0
votes

here is what I have done, but it appears disorderly. Thanks in advance.

1.use CGPDFStringCopyTextString to get the text from the pdf

2.encode the NSString to char*

NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000);
const char *char_content = [self.currentData cStringUsingEncoding:enc];

Below is how I get the currentData:

void arrayCallback(CGPDFScannerRef inScanner, void *userInfo)
{
  BIDViewController *pp = (__bridge BIDViewController*)userInfo;
  CGPDFArrayRef array;
  bool success = CGPDFScannerPopArray(inScanner, &array);
  for(size_t n = 0; n < CGPDFArrayGetCount(array); n += 1)
  {
      if(n >= CGPDFArrayGetCount(array))
          continue;
      CGPDFStringRef string;
      success = CGPDFArrayGetString(array, n, &string);
      if(success)
      {
          NSString *data = (__bridge NSString *)CGPDFStringCopyTextString(string);
          [pp.currentData appendFormat:@"%@", data];
      }
  }
}
 - (IBAction)press:(id)sender {
    table = CGPDFOperatorTableCreate();
    CGPDFOperatorTableSetCallback(table, "TJ", arrayCallback);
    CGPDFOperatorTableSetCallback(table, "Tj", stringCallback);
    self.currentData = [NSMutableString string];
    CGPDFContentStreamRef contentStream = CGPDFContentStreamCreateWithPage(pagerf);
    CGPDFScannerRef scanner = CGPDFScannerCreate(contentStream, table, (__bridge void *)(self));
    bool ret = CGPDFScannerScan(scanner);
}
1
What do you mean by disorderly? And have you checked whether the output of CGPDFStringCopyTextString indeed is GB_18030_2000 encoded?mkl
I mean that characters changed to be irrecognizable because different coder or other reasons. And how to get the encoding type of CGPDFStringCopyTextString?Thxalexqinbj
I want to encode the CGPDFStringCopyTextString with GB_18030_2000, it should show the chinese correctly, but it doesn't works....alexqinbj
The problem may already be before your GB_18030_2000 encoding is used, cf. my answer below.mkl

1 Answers

1
votes

According to the Mac Developer Library CGPDFStringCopyTextString returns a CFString object that represents a PDF string as a text string. The PDF string is given as a CGPDFString which is a series of bytes—unsigned integer values in the range 0 to 255; thus, this method already decodes the bytes according to some character encoding.

It is given none explicitly, so it assumes one encoding type, most likely the PDFDocEncoding or the UTF-16BE Unicode character encoding scheme which are the two encodings that may be used to represent text strings in a PDF document outside the document’s content streams, cf. section 7.9.2.2 Text String Type and Table D.1, Annex D in the PDF specification.

Now you have not told us from where you received your CGPDFString. I assume, though, that you received it from inside one of the document’s content streams. Text strings there, on the other hand, can be encoded with any imaginable encoding. The encoding used is given by the embedded data of the font the string is to be displayed with.

For more information on this you may want to read CGPDFScannerPopString returning strange result and have a look at PDFKitten.