Reproducing the issue
As initially the OP couldn't provide a sample file, I first tried to reproduce the issue with a file generated by iTextSharp itself.
My test method first creates a PDF using the ColumnText.ShowTextAligned
with the string constant which according to the OP returns a good result. Then it extracts the text content of that file. Finally it creates a second PDF containing a line created using the good ColumnText.ShowTextAligned
call with the string constant and then several lines created using ColumnText.ShowTextAligned
with the extracted string with or without the post-processing instructions from the OP's code (UTF8-encoding and -decoding; applying UnicodeCharacterPlacement
) performed.
I could not immediately find the UnicodeCharacterPlacement
class the OP uses. So I googled a bit and found one such class here. I hope this is essentially the class used by the OP.
public void ExtractTextLikeUser2509093()
{
string rtlGood = @"C:\Temp\test-results\extract\rtlGood.pdf";
string rtlGoodExtract = @"C:\Temp\test-results\extract\rtlGood.txt";
string rtlFinal = @"C:\Temp\test-results\extract\rtlFinal.pdf";
Directory.CreateDirectory(@"C:\Temp\test-results\extract\");
FontFactory.Register("c:\\windows\\fonts\\tahoma.ttf");
Font tahoma = FontFactory.GetFont("tahoma", BaseFont.IDENTITY_H);
// A - Create a PDF with a good RTL representation
using (FileStream fs = new FileStream(rtlGood, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document document = new Document())
{
PdfWriter pdfWriter = PdfWriter.GetInstance(document, fs);
document.Open();
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
x: 500,
y: 300,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
}
}
// B - Extract the text for that good representation and add it to a new PDF
String textA, textB, textC, textD;
using (PdfReader pdfReader = new PdfReader(rtlGood))
{
textA = PdfTextExtractor.GetTextFromPage(pdfReader, 1, new LocationTextExtractionStrategy());
textB = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(textA.ToString()));
textC = new UnicodeCharacterPlacement
{
Font = new System.Drawing.Font("Tahoma", 12)
}.Apply(textA);
textD = new UnicodeCharacterPlacement
{
Font = new System.Drawing.Font("Tahoma", 12)
}.Apply(textB);
File.WriteAllText(rtlGoodExtract, textA + "\n\n" + textB + "\n\n" + textC + "\n\n" + textD + "\n\n");
}
using (FileStream fs = new FileStream(rtlFinal, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document document = new Document())
{
PdfWriter pdfWriter = PdfWriter.GetInstance(document, fs);
document.Open();
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk("Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم", tahoma)),
x: 500,
y: 600,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textA, tahoma)),
x: 500,
y: 550,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textB, tahoma)),
x: 500,
y: 500,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textC, tahoma)),
x: 500,
y: 450,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
ColumnText.ShowTextAligned(
canvas: pdfWriter.DirectContent,
alignment: Element.ALIGN_RIGHT,
phrase: new Phrase(new Chunk(textD, tahoma)),
x: 500,
y: 400,
rotation: 0,
runDirection: PdfWriter.RUN_DIRECTION_RTL,
arabicOptions: 0);
}
}
}
The final result:
Thus,
I cannot reproduce the issue. Both the final two variants to me look identical in their Arabic contents with the original line. In particular I could not observe the switch from "سلام" to "سالم". Most likely content of the PDF C:\Users\USER\Desktop\Test.pdf
(from which the OP extracted the text in his test) is somehow peculiar and so text extracted from it draws with that switch.
Applying that UnicodeCharacterPlacement
class to the extracted text is necessary to get it into the right order.
The other post-processing line,
text = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(text.ToString()));
does not make any difference and should not be used.
For further analysis we would need that PDF C:\Users\USER\Desktop\Test.pdf
.
Inspecting salamword.pdf
Eventually the OP could provide a sample PDF, salamword.pdf:
I used "PrimoPDF" to create a PDF file with this content: "Test. Hello world. Hello people. سلام. کلمه سلام. سلام مردم".
Next I read this PDF file. Then I received this output: "Test. Hello world. Hello people. م . م . م دم ".
Indeed I could reproduce this behavior. So I analyzed the way the Arabic writing was encoded inside...
Some background information to start with:
Fonts in PDFs can have (and in the case at hand do have) a completely custom encoding. In particular embedded subsets often are generated by choosing codes as the characters come, e.g. the first character from a given font used on a page is encoded as 1
, the second different as 2
, the third different as 3
etc.
Thus, simply extracting the codes of the drawn text does not help very much at all (see below for an example from the file at hand). But a font inside a PDF can bring along some extra information allowing an extractor to map the codes to Unicode values. These information might be
- a ToUnicode map providing an immediate map code -> Unicode code point;
- an Encoding providing a base encoding (e.g. WinAnsiEncoding) and differences from it in the form of glyph names; these names may be standard names or names only meaningful in the context of the font at hand;
- ActualText entries for a structure element or marked-content sequence.
The PDF specification describes a method using the ToUnicode and the Encoding information with standard names to extract text from a PDF and presents ActualText as an alternative way where applicable. The iTextSharp text extraction code implements the ToUnicode / Encoding method with standard names.
Standard names in this context in the PDF specification are character names taken from the Adobe standard Latin character set and the set of named characters
in the Symbol font.
In the file at hand:
Let's look at the Arabic text in the line written in Arial. The codes used for the glyphs here are:
01 02 03 04 05 01 02 06 07 01 08 02 06 07 01 09 05 0A 0B 01 08 02 06 07
This looks very much like an ad-hoc encoding as described above is used. Thus, using only these information does not help at all.
Thus, let's look at the ToUnicode mapping of the embedded Arial subset:
<01><01><0020>
<02><02><0645>
<03><03><062f>
<04><04><0631>
<08><08><002e>
<0c><0c><0028>
<0d><0d><0077>
<0e><0e><0069>
<0f><0f><0074>
<10><10><0068>
<11><11><0041>
<12><12><0072>
<13><13><0061>
<14><14><006c>
<15><15><0066>
<16><16><006f>
<17><17><006e>
<18><18><0029>
This maps 01
to 0020
, 02
to 0645
, 03
to 062f
, 04
to 0631
, 08
to 002e
, etc. It does not map 05
, 06
, 07
, etc to anything, though.
So the ToUnicode map only helps for some codes.
Now let's look at the associated encoding
29 0 obj
<</Type/Encoding
/BaseEncoding/WinAnsiEncoding
/Differences[ 1
/space/uni0645/uni062F/uni0631
/uni0645.init/uni06440627.fina/uni0633.init/period
/uni0647.fina/uni0644.medi/uni06A9.init/parenleft
/w/i/t/h
/A/r/a/l
/f/o/n/parenright ]
>>
endobj
The encoding is based on WinAnsiEncoding but all codes of interest are remapped in the Differences. There we find numerous standard glyph names (i.e. character names taken from the Adobe standard Latin character set and the set of named characters
in the Symbol font) like space, period, w, i, t, etc.; but we also find several non-standard names like uni0645, uni06440627.fina etc.
There appears to be a scheme used for these names, uni0645 represents the character at Unicode code point 0645, and uni06440627.fina most likely represents the characters at Unicode code point 0644 and 0627 in some order in some final form. But still these names are non-standard for the purpose of text extraction according to the method presented by the PDF specification.
Furthermore, there are no ActualText entries in the file at all.
So the reason why only " م . م . م دم " is extracted is that only for these glyphs there are proper information for the standard PDF text extraction method in the PDF.
By the way, if you copy&paste from your file in Adobe Reader you'll get a similar result, and Adobe Reader has a fairly good implementation of the standard text extraction method.
TL;DR
The sample file simply does not contains the information required for text extraction with the method described by the PDF specification which is the method implemented by iTextSharp.
PdfReader reader = new PdfReader(@"C:\Users\USER\Desktop\salam.pdf");PdfStamper stp = new PdfStamper(reader, new FileStream(@"C:\Users\USER\Desktop\File1.pdf", FileMode.Create)); stp.Close();reader.Close();
– user2509093