Author Topic: the ultimate cameca binary data reverse engineering attempt  (Read 2070 times)

sem-geologist

  • Professor
  • ****
  • Posts: 311
This thread is dedicated to consolidate and describe attempt of reverse engineering binary data files produced with Cameca Peaksight version 5.1 (or 5.0) to 6.4.
This is aimed for interoperability and hopefully can provide some nice additions to PfE (i.e. reading of optical, single  channel video or multichannel/mapping impDat, but other options like reading wdsDat or calDat I think would be also interesting to have).

Forum is used and read mostly by PfE users, but I see that not all forum users are PfE users (including myself ofc), and this RE attempt can be useful for making independent workflows from PfS. Also it can be useful for recovering old datasets and other customized necessities with old data (for users which got PfE, but still have lots of old data in PeakSight formats). So this RE is done in essence independently from PfE.

There are some previous threads which discussed or mentions lack of knowledge about impDat structure:

There is no questions for qtiDat, calDat or wdsDat in the forum, thus I am adding some code examples of usage for impDat first.
Albeit my personal reasons for RE lays mostly in wdsDat and impDat together with qtiDat, and python code of direct implementation using this RE descriptions will be firstly for wdsDat (I need that to prove some points in some other forum threads).

A bit of history and how it evolved into current shape
This attempt started as attempt to do reverse engineering of simple (produced with "Save Image" button in peaksight) impDat images, but after seeing how complicated it gets with impDat produced with mapping workflow, the attempt was stalled. Later with a need to read massive wdsDat, the attempt was continued. It looked OK for PeakSight 6 produced wdsDat and impDat, but was not working for PeakSight 5 produced files. Up to that point RE was done in old-school style, that is by taking notes of offsets from bisecting different binary files in hexeditors (wxHexEditor, and partly Hexinator) and directly writing/modifying code for parsing in target language (python). It was hard to find logic and elegant differences between versions and so that attempt stalled again. I should also mention that I had RE ovl files as very first (As far I remember I  was already using it in 2016), and had made some ovl file manager, as I find such functionality in PeakSight lacking, even in v6.4.
And there is where Kaitai Struct came to rescue https://kaitai.io/. Kaitai Struct is language agnostic way to describe the binary formats (but actually it is an excellent tool for reverse engineering binary formats). This also allows to be flexible as parser can be used with change of language of chose in which some custom workflow/methods are defined.

Kaitai struct is declarative (to emphasize: it is not imperative) way of parsing, meaning that bisecting between different kind of files and versions is much more easier to do (compared to traditional hex-editors it has its pros in that, but it has cons in other places). For this the attempt of RE only particular type of files was replaced with the full-scale attempt to parse all of .***Dat files, as that gives larger sample of structures to look into, and allows to recognize common structures used in between those different files. As the .***Dat file descriptions were nearly finalized, I realized then that the .***Set files has similar philosophy, and while its direct usefulness (necessity to RE) is questionable, the successful parsing of those could cover the .***Dat files better. And so the parser was extended to partially-parse .***Set files and .ovl (overlap correction) files. Bisecting all of these files makes it clear where common header structure finishes, and different structure types starts (bisecting only ***Dat files makes it hard to point where the header ends, as similar structures of ***Dat files are occupying sectors after header - this is why including .***Set and .ovl files into RE workflow is actually very important).

The binary structure description in Kaitai Struct is saved as .ksy files. Descriptions inside .ksy is based on YAML (Initially I though it is an acronym of "Yet Another Mark[up/down] Language", but creators say that it is "YAML Ain't Markup Language", which I find kind of hilarious).

The repository of this attempt is on github, where this binary description is updated, opening issues or pull requests are welcome:
https://github.com/sem-geologist/peaksight-binary-parser

The simple method to check if Your binary files can be parsed with the current state of format description is by using kaitai struct web-IDE https://ide.kaitai.io/. It works in the most popular web browsers (chromium, firefox, brave...). After downloading ksy file from github repository, it can be drag-dropped into IDE, and then d&d binary file which is wished to be inspected. They appear in the list at the left of IDE. double clicking on ksy and then on binary file will select the parser and file to be parsed, and that will result in highlighting the structure in hexviewer and generating the object tree view (which is nicely interconnected, clicking on hexviewer on highlighted part it will bring focus in object tree to the selected node, and it works other way around).
It is possible to generate the parsing code for one of the languages by right clicking on ksy file in list and choosing target language from the menu. The generated code will appear as one of tabs over hexview. Compilations to some languages (i.e. C++, header and implementation) generates few files in separate tabs. Content can be copied to a new file in the plain text editor. To use the generated code some Kaitai runtime libraries for given target language needs to be downloaded (for some target languages package managers can be used).

Parsing in The Kaitai Struct Web-IDE demonstration:
« Last Edit: June 02, 2021, 12:29:14 PM by sem-geologist »

sem-geologist

  • Professor
  • ****
  • Posts: 311
Re: the ultimate cameca binary data reverse engineering attempt
« Reply #1 on: May 27, 2021, 03:38:55 AM »
Continued...

Naming of objects in structure.

I would like to hear a feedback for my naming scheme.
In binary format there is some tree structure, and while the relation and position in the object tree can be tracked in binary structures, the exact meaning of object or value is a bit subjective. Things which could be named as seen in GUI I had named the same (i.e. beam_current, hv, bias, gain, dead_time..., dataset) but some naming is out from nowhere or its meaning has no direct mapping with GUI elements (i.e. signal, dataset Item), but that after giving a long thought made sense for me.
The main parsed file object is divided into two sub structures: sxf_header, and sxf_main. sxf_header contains very general information about file and its changes. sxf_main contains data and metadata, machine settings ....
Lets say we have a simple impDat generated with "Save Image" button clicked, named as dummy_img.impDat or simple impDat made with saving the optical image.
Prior below python example the ksy was compiled into python code as cameca.py; kaitai_struct runtime was installed with pip.

Below python example reads the file, references some metadata to some variables, gets data bytestring which is stuffed into numpy array, with conditionally created data type; then it is plotted using matplotlib (numpy and matplotlib is quite common available with most of python bundles, unless it very basic python installation):

Code: [Select]
from cameca import Cameca  # if ksy file was compiled into cameca.py and is in path of execution
import numpy as np
import matplotlib.pyplot as plt

single_img = Cameca.from_file('path_to_data/V30_2.impDat')

# dataset = single_img.sxf_main.datasets[0]  # previous naming scheme
dataset = single_img.content.datasets[0] # latest
stage_x = dataset.dataset_header.stage_x
stage_y = dataset.dataset_header.stage_y
stage_z = dataset.dataset_header.stage_z[0] # as this is array of z values
# img_item = dataset.dataset_items[0]  # previous naming scheme
img_item = dataset.items[0]  # latest

x_y_z = 'x:{} y:{} z:{}'.format(stage_x, stage_y, stage_z)
if img_item.signal_type == single_img.SignalSource.video:
    beam_current = img_item.signal_header.current_set
    beam_hv = img_item.signal_header.hv_set
    signal_name = img_item.signal_header.video_signal_type.name
elif img_item.signal_type == single_img.SignalSource.im_camera:
    signal_name = 'optical microscope'  # there is no string embedded

width = img_item.signal.width
height = img_item.signal.height
dx = img_item.signal.step_x
dy = img_item.signal.step_y
width_um = width * dx
height_um = height * dy
_d_type = img_item.signal.img_pixel_dtype.name
if _d_type == 'uint8':
    dtype = (('u1', (width,)))
elif _d_type == 'rgbx':
    dtype = np.dtype((('u1', (4,)), (width,)))
data_bytearray = img_item.signal.data[0]
array_for_img = np.frombuffer(data_bytearray, dtype=dtype)

if len(array_for_img.shape) == 3:
    array_for_img = array_for_img[:,:,:3] # skip the x in rgbx

plt.imshow(array_for_img, extent=(0, width_um, 0, height_um))
plt.xlabel('µm')
plt.title(signal_name)
# lets add the coordinates of stage at top center of image
plt.text(width_um/2,height_um, x_y_z, horizontalalignment='center',
         verticalalignment='top', color='r')


if not making references at half of path, accessing data directly from whole tree probably would have the longest path:
Code: [Select]
# single_img.sxf_main.datasets[0].dataset_items[0].signal.data[0]  # previous
single_img.content.datasets[0].items[0].signal.data[0]  # latest naming scheme

For mosaics or mappings code should be more built up:
1. it would need to check for 'float32' as dtype and generate coresponding numpy dtype
2. in above code we had not iterated through .datasets, and later .dataset_items, and finally .data, where we accessed first iterable elements using `[0]` index. That is OK for these simple impDat, but for mappings and mosaics these needs to be iterated.

datasets is self explanatory, dataset_items will hold different kind of elements and video signals used for mapping (EDS, WDS, Video1, Video2, if enabled), data in simple impDat contains only a single bytestring, but for more complex impDat it can contain number of bytestrings and it is used conditionally:
  if it is mosaic- this is where mosaic tiles will be stored. Every tile will have only single bytestring, if Frame number is above >1, the bytestring will be sum of those. (or average?)
  if it is mapping, then the first element (in python index 0, so data[0]) will contain average/sum; and next indexes will contain sub sampling/frames. in case of 1 Frame, there will be 2 identical binary strings. In case of Frames >1, the first will be average, and other will be single frames.

Generally, the beginning of files and structures are much better covered than ending of the big structures and files, and number and size of reserved parts are larger there. Not only meaning is not clear of those bytes, in some cases the place in the tree is not 100% accurate. As far the parser can cut through multi-dataset files that actually does not matter (if reserved field is under dataset, or dataset_extra, or under some additional child of dataset or dataset_extra).
« Last Edit: June 01, 2021, 12:26:10 AM by sem-geologist »

sem-geologist

  • Professor
  • ****
  • Posts: 311
Re: the ultimate cameca binary data reverse engineering attempt
« Reply #2 on: May 28, 2021, 01:38:23 AM »
...continued

The debugging hints

Kaitai Web-IDE has some kind of debugger which I could not sort out (I have limited knowledge about javascript, which is used for debugging). What to do when things goes south? One of most common cases is with some unexpected (or rather previously unrecognized) structure is activated and inserted into bytestring at unexpected address. i.e. we have defined reserved field of 24 bytes, we expect some string after (i.e. lets say a comment) which is c# string, a string pre-pended with unsigned 32 integer telling how long the string is. If the length (some unexpected structure increasing the size) of that reserved field is somehow changed, then the bytes which should point to that string length will be read to early, and if hit some large number, that will be interpreted as length of the string and can fail in few outcomes: 1) run out of total file size while trying to read bytes as very long string 2) it will successfully read bytes of given number as a string, but other following objects will be read completely with wrongly offset addresses and will result in error in other place.

This is quite a con of Kaitai web-ide, if it hits error, it wont show the structure up to the failing attribute, it wont show highlights in hexview and will not actualize the object tree view; instead it throws cryptic error like this "tried to parse 2345678768 bytes but only 8 bytes were up to end of the file"; It is like "Go figure yourself where the problem is", no hint of address where parsing derailed from our scheme.   
When situation like that arises I have a working systematic approach (precise commenting out of lines with hashtag):
  1) comment out parsing of "sxf_main" under the main "seq" - that allows to see if file header is parsed correctly;
if it is ok, uncomment the parsing of sxf_main.
  2) comment-out everything before datasets under "sxf_main", if that does not help, then comment out the 'repeat-expr: *****' and "repeat: expr" lines or comment the expression and add 1 before expression (like this 'repeat-expr: 1 # *****') - the point is to restrict parsing to single dataset, so we can track in the hex view and object tree where the stuff had derailed. Some times it can be that offending dataset is not the first but some middle dataset, by changing '1'  with other number (i.e. middle dataset) we can narrow down the dataset, where things derail. and depending if it is 1 or other datasets, the approaches of debugging are a bit differing.
  3A) in case parsing completely fails while parsing 1st dataset. Comment out increasingly from the end the children structures, and then got to point of iterative expression restrict those to 1.
  3B) in case parsing completely fails while parsing nth (where n != 1) dataset, this requires a bit more creative approach, we can't comment out parts of children structures of datasets as then 2nd dataset will start at wrong offset. What we do in that case is set the "repeat-expr: n" to number of dataset which still is parsed OK, and then make a copy of children dataset structures (like dataset_item), rename them with example adding _debug, and inject the parsing of that *_debug type directly after main dataset parsing loop. that *_debug type can be dealt like in 3A, and so the part where stuff goes wrong can be easily found.

This allows to find any non-standard structures added to the file just in a few minutes.
« Last Edit: May 28, 2021, 01:42:44 AM by sem-geologist »

sem-geologist

  • Professor
  • ****
  • Posts: 311
Re: the ultimate cameca binary data reverse engineering attempt
« Reply #3 on: May 30, 2021, 02:00:43 PM »
Some minor (API braking) changes

No one is giving any feedback and I started to look how to use this directly in serious way in Python. First of all at last I could scrap my previous direct parsing attempt in python  - I am really happy with ksy generated classes as it is actually much much more elegant solution. Initially I was not sure about outcome and thought that ksy will do lots of code limitations, and so I thought previously that I will use ksy generated code only for file parsing, and would copy relevant parsed data and metadata to attributes of my custom python objects, and then dump the parser object for garbage collection. But I started to realize that generated classes are not bad at all, and that directly sub-classing parser will go much more fluently. I.e. kaitai puts some really nice stuff in place  like i.e. `_parent` attribute.  When adding `_children` the hierarchical tree is ready in place. With that in mind it is much more important for me to get right concise and logical attribute names from the beginning. I hate to type long stuff, and it is good to keep attribute names as short while keep enough informative.

So I think `sxf_header`, and `sxf_main` could be renamed into simply `header` and `content`. (dropping sxf and leaving "main" is not very informative, while "header" and "content" looks informative enough in itself to tell what it is). Another rename I would do for "dataset_items" container -> "items" would be sufficient as it is hierarchically clear what kind of items are those (as they sit under dataset. I had gave lots of thought about type names when doing RE, but attribute name usefulness or "getting in the way as being to verbose" can be seen only when this is starting to be tried to use. So these changes will be only for attribute names, but type names will be kept the same. `_root.header` will still have type `sxf_header` as there can be all kind of `headers`, and they are defined in ksy at the same level under one of main section `types:`.

One more Ksy limitation is that it does not allow using anything other than lower case explicitly Latin alphanumeric and numbers for enum's. ( it is casting it to ALL CAPS in some languages, and in some languages leaving as is (i.e. for python, albeit enums should be ALL CAPS by python code formatting specs)). Previously I had not parsed elements or xray_lines leaving to deal with that for target language. I decided to change that and make such enum. Again that is due to realization that sub classing the parser can be more beneficial, and so doing as much parsing into right form at file-reading-time is more clean solution than do post-read mangling/reassignments/copies/typecasts of data. For this it is much better to define as much stuff as possible in ksy, even with simple type holding the single attribute of single integer. As it is much easier to override some predefined and used types (classes) or methods, than monkey-patch the objects. Kaitai has two additional official ways: opaque_types and processes; Maybe one of them would do the tick. But for xray lines simple enum will do the trick, and that enum then can easily be overridden in the target language, which is much easier to do than deal with opaque_types or processes, or monkey-patching... at least in Python where lots of things are easier.
« Last Edit: June 02, 2021, 05:35:46 AM by sem-geologist »

sem-geologist

  • Professor
  • ****
  • Posts: 311
Re: the ultimate cameca binary data reverse engineering attempt
« Reply #4 on: January 31, 2022, 08:14:28 AM »
I had missed completely that there is this Method Development Tool software. Looks that part of my software using this parser is a bit similar in capabilities. The difference is that my tool is offline tool and works natively on linux, windows, and mac (not tested by me, could have hi dpi scaling problems), and have instantaneous loading speed of datasets (currently peaksight wdsDat only), requires no registration no license agreement, and is a lot more snappier, contains  curve highlighting, global alpha for curves (transparancy for increased visibility) multiple wds plotting windows, contains movable markers and background modelling for single bkgd, and two bkgd (linear and exponential). The only missing feature is burning of the KLM markers (albeit You will see the preview version and hopefully love it), which MDT has. Behold  ;D, here comes HussariX https://github.com/sem-geologist/HussariX. If You have no peaksight wdsDat data, You can try it out with a sample file at https://github.com/sem-geologist/peaksight-binary-data-examples.

* it can contain some bugs, and missing features, but as a tool can be nice to have, I don't imagine any new complicated work without it ,saves tones of times.
« Last Edit: January 31, 2022, 09:25:34 AM by John Donovan »

Nicholas Ritchie

  • Professor
  • ****
  • Posts: 161
    • NIST DTSA-II
Re: the ultimate cameca binary data reverse engineering attempt
« Reply #5 on: January 31, 2022, 10:26:02 AM »
Your HussariX looks like it will be really useful.  It is a big win that you've been able to reverse engineer the Cameca file format.  Proprietary formats are a scourge for open science.
"Do what you can, with what you have, where you are"
  - Teddy Roosevelt

sem-geologist

  • Professor
  • ****
  • Posts: 311
Re: the ultimate cameca binary data reverse engineering attempt
« Reply #6 on: January 31, 2022, 03:14:42 PM »
Nicholas,
Thanks for kind words. I just want to say that I am pretty sure that Cameca was/is not using proprietary format to hide stuff intentionally, but that it was formed along the least resistance path lead by writing software with C#/.Net. I am very pretty sure that format was not even human designed (it is not a first RE of data structure, thus I can recognize human design), but is a simple binary memory chunks of structs representing objects from running program copied over into the file (then saved), and restored to make back the objects (then opened). That is the price of choosing developing on platform which hates (or hated) standards and interoperability (I mean µ$). Look what a mess is msxml implementation i.e. how it handles floating point. Then there was internal dilemma if XML should internally save floating point with internationally standardized dot or local representation (which is dot in many countries, but some european countries writes floating point with comma). That is not a problem, until list of floating points (i.e. array) needs to be saved inside XML, and msxml finds in local settings that floating number separator... is comma, which is also separator of elements in XML... yeah...
You can have lots of fun with Bruker spx (Esprit uses msxml to deal with xml) i.e. when someone in lab decides to change locale from US-en into some East European... I could come up with some hack around in Hyperspy for bcf and spx, but this is such a waste of computing cycles because someone in MS was so short-sighted and just decided that XML will never be used for information exchange between computers (or that someone will change locale from US to some other retarded-differently unit system).

So in the end of the day we see Bruker Esprit is using open formats... (except bcf) but interoperability is really brittle due to using msxml when moving data between computers with different locales. Cameca used binary format and that guaranties that file can be copied to other machine and opened without any problem and misinterpretation. I am actually happy that Cameca went that way, the format at this point looks somehow logical (if one thinks like machine) and in the end of the day I could successfully reverse engineer its structure. I dread on thought that it could had been much worse. I.e. Cameca would use jet database as data format, where is no easy and full-feature-complete OS agnostic library for reading that, that is much harder and probably even impossible to reverse engineer as somehow it is tied with underlying file system.

So I would not judge Cameca for binary formats at all. It is easy to throw something between lines "proprietary formats are a scourge" while we are dwelling on top of towers of Python, Perl, Java, Julia, R... Where we have unrestricted support to all kind of good database engines, proper international xml standardized implementations, sane and performance oriented open source compression and containerization engines (zip, hdf5)... and so many more right tools and formats for right problems and for free. C#/.NET in particularly 20 year ago had not this, it was boiling in its own  M$ world of COM, VBA, jet db... and some small proprietary solutions for lot of missing functionality. If I would need to blame Cameca I would blame them for leaving neatly built Unix platform and moving to M$. But even there it is hard to blame them knowing what had been done to sun/solaris by oracle. We know now that quite many HPC Unixes survived, and new beast - Linux now takes a lead in high performance scientific computing. Probably old unix code could been then recompiled to FreeBSD (I think that would be closest Solaris replacement)... But looking 20 years ago, It could look that unix as such days are ended (and there additionally was going this bizarre SCO stuff which for commercial world was sending message Unix=trouble), and M$ was taking near everything at least in PC sector. Thus I don't want to blame Cameca on anything.

And btw, I am not looking to make a free portable replacement for SXResults. The outcome could look or take that direction, but that is not intentionally. My intention is to extend functionality and have less restricted means to research on behaviour of method. I could actually go different way and work with exported ascii text files. I have experience in that as my very first software was AWK (what a remarkably fast language for text parsing) scripts which parsed textual output from SX100 Solaris Peaksight(?, I cant remember how it was called there?), and was pushing them into PostgreSQL database for further treatment (Remember kids, during PhD studies You should look for the most bizarre way to move, transform, slice and dice your data from one table to another, from one format to another – everything is legit which occupies you and your computer and makes you stay away from real data analysis.. and from writing the thesis). However, I got berserk in reverse engineering in some chapter of my life...

I don't feel heroic for RE these formats. After reverse engineering bruker's bcf format, I felt sad that it is already working (that means I would need to sit and write that Thesis), and looked for next thing to RE. Probably I would had RE some toaster, but those Peaksight binary files got in my way (Peaksight was missing the export to ascii at that moment, and I just got so frustrated with being thrown in that mouse clickery from my comfy Unix-awk-postgresql world), and I started, and that was getting challenging. And I don't like to give up. (Actually I can think up some demotivating moments, i.e. when I wrote and defended my thesis, and had no more need to avoid of writing it.)

So what such a massive high current wdsDat set can show, and what can it answer?

1) I was trained to do two side-background in most of cases, and had been doing such analysis for long as I was lead to believe, and could not find and answer in any book if slope stays the same between different density materials, for single sided background, where we see that background has different intensity levels.
ANSWER: despite intensity levels moving, the slope stays the same. i.e. is it MgO standard or ThO2 – if there is no spectral artefacts at given peak position (at blank) and single background position and there is no any absorption edges or strong absorption nearby. That made me embrace single background measurements which has lots of very important features, i.e. it is much easier to find spot for single background than two or many backgrounds when there is lots of elements and spectral space is densely packed with peaks. It is easier to avoid crossing the absorption edges between background and peak measurements (in particular other element absorption edges).
2) It allows to find positions and do some tricks like self-interference correction technique with two-backgrounds, or single background. (that is the creative out-off-book background measurement, corrects the peak interference at directly at measurement, and no software/matrix corrected interference calculations are needed for that single interference). It is helpful to prevent from circular interference correction (i.e. dealing with REE).


 

sem-geologist

  • Professor
  • ****
  • Posts: 311
Re: the ultimate cameca binary data reverse engineering attempt
« Reply #7 on: February 09, 2022, 01:51:37 PM »
There is going to be a minor API change, but that is needed for more efficient loading of large data. Currently it is in a separate branch (https://github.com/sem-geologist/peaksight-binary-parser/tree/lazyable-data) but it will get merged soon to master of parser. HussariX had already adopted it. The idea is to parse all small metadata from whole file, but take a note of absolute offsets and sizes for large chunks of continuous binary data (such as WDS wavescan, or image frames). Idea is hardly achievable using kaitai struct alone as it is designed so that it reads into memory all bytes at least once and there is no possibility to skip even a byte. But when using types for data, it creates class, and class'es in modern OOP languages can be overwritten with modified versions. In python that is very easy. So the modification required in kaitai struct is that such binary lump of data would not be of atomic type (binary string) but more complex type with its attributes which forces kaitai to create a class for that. Using kaitai parameters the type is provided with global offset and size, and so lazy method can be created for reading that data (lazy_bytes); However in kaitai_struct as I said, everything needs to be read at least once to move the cursor (In case of organized files with embedded table of content that is not required, and it is possible to parse not in order, but in our case file needs to be parsed byte by byte). So for that dummy seq "parsed_bytes" is defined in ksy under lazy_data type, which when translated to target language is placed in `_read` method where given size is is read an assigned as byte string. So overwriting the `_read` method and forcing it to seek after the large chunk makes it possible to skip the array data parsing and allows for memory saving.

That can look not so exciting, and probably for WDS wds scans this brings not much memory saving as wavescans are not memory hungry. However impDat images, and in particularly those acquired with mapping workflow are huge as contains frames or tiles. with frames they are particularly fat files as even if there only single frame (accumulate n), it will be saved as two chunks - the last one for that frame, and first chunk as average or sum (depends what was set in GUI). That first chunk is what in most of cases we would want to load, and frames would be nice to have in demand. Such lazy loading will save a lot of memory, and loading even 3GB impDat should then take only fraction of that memory.

But as I said, kaitai struct implementation is not enough until it is not modified to not read the data in the target language. Python example how to do that can be find in HussariX cameca.py wrapper (and sanitizer) (https://github.com/sem-geologist/HussariX/blob/master/lib/parsers/cameca.py#L95-L115); As You can notice in code, alongside overriding "_read" method (it is called during __init__, which is inherited), "lazy_bytes" are also reimplemented. The kaitai struct closes the file after the complete parsing, if it is invoked in python through "with open" (which is very right thing to do), and thus registered at instance of that class _root._io is closed at moment for lazy parsing, thus that reimplementation opens the file and directly reads particular data chunk. Would such reimplementation be needed probably depends from the target language and its file handling.

update: actually I gave a thought about naming scheme and  lazy_data will have shorter attribute (or getters, setters) named `bytes` - less typing, and if it is lazy or not-lazy reading depends from proper modification of that class/type in the target language. So is it lazy or not it will be data.bytes; the only difference that data also will contain : data.offset and data.size, so that proper implementation overriding data.bytes with lazy reader can be very easily achieved, and API would be same regardless way of reading those bytes.

P.S. Some time ago While opening some 2GB impDat file in Peaksight and looked to RAM meter, I clearly saw that it loads the datasets data on-demand and use much less RAM than whole file. I was quite sure I would never could achieve anything similar like that due to dynamic character of those files. I could not expect that it is quite easy to achieve that.
« Last Edit: February 11, 2022, 04:37:07 AM by sem-geologist »