6
votes

By opening many executable (.exe., .msi) files in Windows using 7zip, i have noticed many different file types that are common. Those include .text, .data, .bss, .rdata, .pdata etc.. I've tried to get information about them, but i can't find out what they all mean. Here's some of them:

  • .text : Code section, contains the program’s instructions - read only -.
  • .data : Generally used for writable data with some initialized non-zero content. Thus, the data section contains information that could be changed during application execution and this section must be copied for every instance.
  • .bss : Used for writable static data initialized to zero.
  • .rdata : Const / Read-only data of any kind are stored here.
  • .edata : Export directory, descriptors & handles
  • .idata : Import directory for handles & descriptors. It is used by executable files (exe's, dll’s etc.) to designate the imported and exported functions.
  • .rsrc : Section which holds information about various other resources needed by the executable, such as the icon that is shown when looking at the executable file in explorer

There are many others, which are common and i can't find any information on. Mostly those are: .pdata, .tls, .reloc, CERTIFICATE, .rsrc_1, .aspack, .adata, .INIT, DATA, CODE, .ctors.

Also a rsrc folder is contained in most of them, which contains folders like BITMAP, CURSOR, ICON, GROUP_CURSOR, GROUP_ICON, MENU, VERSION and others.

Some executables also contain more executables inside, .html files, .txt files etc. I also opened one which contained nothing at all (at least nothing shown by opening it with 7zip)! [ I opened them all with 7zip. ]


Questions

  1. What those sections / segments i posted do? Is there a website where i can find them all?
  2. All those i looked at are PEs for Windows. Are these formats standard and apply to LINUX, UNIX etc. in a similar / same way?
  3. Why do some executables contain other executables inside, or .html, .txt and other files? How are these handled when you launch the executable? What are they supposed to do? AFAIK everything inside an executable should have only those "segments" that resemble assembly code sections.
  4. What is the use of rsrc folder? What kind of resources does it hold?

I would appreciate it, if you could post more information / links as to why are all these used (as low level as possible) and generally how the executable structure should look like, what it should contain etc.

That's about all.


EDIT

I found other common section header names. I will post their meaning here for completeness.

  • .reloc : Contains the relocation table.
  • .pdata : contains an array of function table entries for exception handling, and is pointed to by the exception table entry in the image data directory
  • *data : custom data section names
  • .init : This section holds executable instructions that contribute to the process initialization code. That is, when a program starts to run the system arranges to execute the code in this section before the main program entry point (called main in C programs).
  • .fini : This section holds executable instructions that contribute to the process termination code. That is, when a program exits normally, the system arranges to execute the code in this section.
  • .ctors : Section which preserves a list of constructors
  • .dtors : Section that holds a list of destructors
2

2 Answers

6
votes

Section names are not relevant to the file format, the toolchain (linker typically) can pick anything it likes. The operating system does not use names to find sections it cares about back, it uses the data directory in the file header. Which contains numbers, not names. The name just serves as a mnemonic to help identify sections. Or might be used to help a language runtime or debugger find sections back that are not covered by the data directory.

There is some consistency in section names, largely by convention. A weirdo section name like BSS goes all the way back to the 50's, used in Fortran, an acronym for Block Started by Symbol. Does not help much to guess at its use today :) And you can assume that a section named CODE will contain executable code and is equivalent to .text, the much more common name choice. Names like .tls and .reloc can be mapped to the corresponding data directory entry without much trouble.

Same receipe for .rsrc, maps to the third entry in the data directory. Matters to the OS, a winapi function like LoadString needs it.

However, only knowing the tool chain in detail gives you a real cue to the oddball ones.

The operating system loader places a section directly into virtual memory through a memory-mapped file that uses the executable file as the backing store. Which is how sections like .text, .data and .bss are used, note how they don't have a corresponding entry in the data directory. The linker took care of generating the proper addresses, the way it was done 25+ years ago with no help need from the OS. Other than the .reloc section if the file could not be mapped to its preferred base address, that's old.

2
votes

There is no magic here, nor is there a universal rule on construction of an executable. You do have to follow the rules for the operating system in question and feed it only executable formats that it understands. But even with that those sections, despite very common use across compilers across decades of time, are technically arbitrary. You basically did all the work you needed to do and answered your own question.

The operating system only needs to know a couple of things. How much of this file is actually loadable data, and where do I load it. the operating system doesnt know/see .text from .data, it sees loadable blocks. It copies those blocks from the file into memory, then it sees an entry point defined, it branches to that entry point. The rest of the information is...information...for the debugger be it software or a human interested in seeing how much or what the compiler placed in the .data section for example.

it is directly or indirectly up to the programmer to properly use those sections, normally the programmer doesnt directly get involved. Software called a bootstrap, performs the job of dealing with those sections as needed. For example the bootstrap normally zeros the .bss section, a system design solution in the compiler toolchain, tells the bootstrap how big and the starting address of .bss and the bootstrap zeros that ram.

.data and .text are normally just loaded by an operating system and dont need further attention as this is being loaded into ram. But for example if this were a microcontroller we need to have our non-zero global .data in non-volatile storage (flash/rom), but when we are up and running our compiled code we need it in ram. So the bootstrap normally does that job of copying .data from flash to ram using a compiler system design solution for telling the bootstrap the starting address in flash and starting address in ram and how much to copy.

The system designs I am talking about is a variable if you will that the assembly language (otherwise it is a chicken and egg problem) bootstrap uses that is filled in by the linker after the linker has done its job and figured out how much there is of everything and where they are placed in the binary or memory image.

Data is data, you can embed data in your binary be it text or html or images (jpg, bmp, png, etc) or other, the right hexdump tool can show that, the toolchain used might even have special section names for that data.

Pretty much all of those section names are in part for debugging the compiler output, and in part informational. A specific toolchain has its specific section names it uses, or perhaps even allows the user (programmer) to create their own arbitrary names as that is all they are. And that specific toolchain uses that information as part of its system design, the compiler sorts out program from data, and it doesnt have to but historically sorts out non-zero global data from assumed to be zero on start global data. And perhaps deeper than that read only non-zero data. It marks those object blobs with names so that the linker can collect all the named blobs and using the linker script do its job of assembling these blobs into bigger blobs and assigning them addresses. And then patching up the binary as needed for resolving external addresses/variables.

Question 1: No there is no website where you can find them all, it is quite possible if not easy to demonstrate that at least one toolchain allows the user to invent its own section name, as such it is not possible for one or many websites to cover all the possible section names and definitions of those section names that some programmer may dream up.

Question 2: There is a general set .text, .data, .bss that have been pretty much adopted and used by most if not all toolchains on all target systems (windows, unix, etc) this is a function of the toolchain not the operating system as the operating system doesnt know or care. It just loads the loadable blobs and branches to the entry point. Since these names are arbitrary and only have to work within the system design of a toolchain, doesnt make sense to ask about operating systems

Question 3: ALL sections, strange or not are managed either indirectly by the toolchain or linked in libraries or directly by the programmer. From .bss to .somethingimadeup.

Question 4: that sounds operating system specific. understand that the operating system defines what the supported executable formats are and what they consist of. The compiler has to conform to that in order to make working binaries. For example an operating system like windows might very much like to have an icon bitmap in the "binary" so that it can show that on the desktop next to the programs name which is also information in the binary. So in addition to the obvious things a binary file format needs to have (offset into the file and size and destination address in memory for loadable blobs of data, and the execution entry point) the file format may have other informational or other items. The file format for a "windows shortcut" might be a subset or special "binary" format whos information is a path and filename to another file rather than code you actually load and run. or shortcut like you might have a "binary" file format that contains a url. But this would all be very much operating system defined and dependent.