Feature/ntfs by yoavhhh · Pull Request #2253 · apache/tika

yoavhhh · 2025-06-17T06:57:45Z

Thanks for your contribution to Apache Tika! Your help is appreciated!

Before opening the pull request, please verify that

there is an open issue on the Tika issue tracker which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes.
the issue ID (TIKA-XXXX)
- is referenced in the title of the pull request
- and placed in front of your commit messages surrounded by square brackets ([TIKA-XXXX] Issue or pull request title)
commits are squashed into a single one (or few commits for larger changes)
Tika is successfully built and unit tests pass by running mvn clean test
there should be no conflicts when merging the pull request branch into the recent main branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled main branch
if you add new module that downstream users will depend upon add it to relevant group in tika-bom/pom.xml.

We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the Tika mailing list. Thanks!

tballison · 2025-07-09T14:48:10Z

Thank you for opening this. I'm not sure that it fits within the current design goals of Tika, but there may be ways forward.

I deeply respect sleuthkit and would be interested in pursuing whatever we can to work together.

I also may misunderstand your use case and this PR. Please bear with me.

My major concern is that Tika is intended to process individual files one at a time. Even with a single large docx or PDF, Tika can go out of memory.

If we treat an entire filesystem as a file (obv with embedded files), I think we're aiming for serious problems.

There are two ways I could see some kind of integration point with Tika.

Create a pipesiterator and fetchers so that Tika could iterate through ntfs or any other format handled by sleuthkit.
Create standardized "Unpackaging" api in Tika that would use sleuthkit commandline(s?) to extract binary files for further processing. There are lots of use cases I've seen where "unpackaging" is required rather than the usual parsing. This is typically a pre-parsing step required to unpackage a bundle of files that someone packages for transfer. For example, this can be useful with zips, PSTs, mbox etc.

tballison · 2025-07-09T14:51:27Z

The above is all high-level. At a lower level, I'm concerned about platform dependent binary code in Tika. We definitely have it in tika-parsers-extended (sqlite3) and in tika-parsers-ml. Another way to handle that is to require users to install the binaries on their system (or in Docker) first, as we do with tesseract (and why we opted not to integrate with tess4j).

holyshitt added 7 commits June 16, 2025 23:17

Added: Ntfs detector & parser

12d6422

Moved: NTFS Parser to extended parsers

7d1e87d

Added: SleuthKit libs (TO BE PUT IN SAFER SPOT LATER)

b7560fc

Added: test img

86e9b46

Added: tsk jars

85e6cd7

Added: 4.13.0 sleuthkit jars and lib files (compiled in java17)

eca4b9b

Added: 4.10.1 sleuthkit jars and lib files (compiled in java17)

f321564

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/ntfs#2253

Feature/ntfs#2253
yoavhhh wants to merge 7 commits intoapache:mainfrom
holyshitt:feature/ntfs

yoavhhh commented Jun 17, 2025

Uh oh!

tballison commented Jul 9, 2025

Uh oh!

tballison commented Jul 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yoavhhh commented Jun 17, 2025

Uh oh!

tballison commented Jul 9, 2025

Uh oh!

tballison commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tballison commented Jul 9, 2025 •

edited

Loading