Skip to content

document_urls_extraction#2425

Merged
mlodic merged 56 commits intodevelopfrom
document_urls_extraction
Sep 17, 2024
Merged

document_urls_extraction#2425
mlodic merged 56 commits intodevelopfrom
document_urls_extraction

Conversation

@federicofantini
Copy link
Contributor

@federicofantini federicofantini commented Jul 17, 2024

Description

Added URLs extraction and pivoting for each document downloaders file types.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue).
  • New feature (non-breaking change which adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).

Checklist

  • I have read and understood the rules about how to Contribute to this project
  • The pull request is for the branch develop
  • A new plugin (analyzer, connector, visualizer, playbook, pivot or ingestor) was added or changed, in which case:
    • I strictly followed the documentation "How to create a Plugin"
    • Usage file was updated.
    • Advanced-Usage was updated (in case the plugin provides additional optional configuration).
    • I have dumped the configuration from Django Admin using the dumpplugin command and added it in the project as a data migration. ("How to share a plugin with the community")
    • If a File analyzer was added and it supports a mimetype which is not already supported, you added a sample of that type inside the archive test_files.zip and you added the default tests for that mimetype in test_classes.py.
    • If you created a new analyzer and it is free (does not require any API key), please add it in the FREE_TO_USE_ANALYZERS playbook by following this guide.
    • Check if it could make sense to add that analyzer/connector to other freely available playbooks.
    • I have provided the resulting raw JSON of a finished analysis and a screenshot of the results.
    • If the plugin interacts with an external service, I have created an attribute called precisely url that contains this information. This is required for Health Checks.
    • If the plugin requires mocked testing, _monkeypatch() was used in its class to apply the necessary decorators.
    • I have added that raw JSON sample to the MockUpResponse of the _monkeypatch() method. This serves us to provide a valid sample for testing.
  • If external libraries/packages with restrictive licenses were used, they were added in the Legal Notice section.
  • Linters (Black, Flake, Isort) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.
  • I have added tests for the feature/bug I solved (see tests folder). All the tests (new and old ones) gave 0 errors.
  • If changes were made to an existing model/serializer/view, the docs were updated and regenerated (check CONTRIBUTE.md).
  • If the GUI has been modified:
    • I have a provided a screenshot of the result in the PR.
    • I have created new frontend tests for the new component or updated existing ones.
  • After you had submitted the PR, if DeepSource, Django Doctors or other third-party linters have triggered any alerts during the CI checks, I have solved those alerts.

@federicofantini federicofantini marked this pull request as draft July 17, 2024 16:18
Copy link
Member

@mlodic mlodic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great changes! It would be cool if you could add a sample example of a successful extraction for all the cases you implemented as a test/confirmation that they work as expected.

@gitguardian
Copy link

gitguardian bot commented Jul 31, 2024

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@federicofantini federicofantini force-pushed the document_urls_extraction branch from aa8cc25 to 7a977ec Compare August 14, 2024 12:16
code-review-doctor[bot]

This comment was marked as outdated.

@federicofantini federicofantini force-pushed the document_urls_extraction branch from 51b8ce1 to 477638c Compare August 20, 2024 06:50
Copy link
Member

@mlodic mlodic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before merging, you need to rebuild the test_files.zip cause these 2 PRs will be merged before yours: #2461 and https://github.com/intelowlproject/IntelOwl/pull/2454/files

@federicofantini federicofantini requested a review from mlodic August 28, 2024 14:17
@mlodic mlodic merged commit 6e37a2b into develop Sep 17, 2024
@federicofantini federicofantini deleted the document_urls_extraction branch September 26, 2024 10:17
Michalsus pushed a commit to standa4/IntelOwl that referenced this pull request Oct 11, 2024
* document_urls_extraction

* added tests and samples

* fixed follina regex

* fixed string delimiter in lnk files url extraction

* fixed onenone test

* fixed strings_info tests max chars

* fixed migration number

* added all test files

* fixed XML attacks vuln

* fixed deepsource warnings

* removed subfolder zip test files

* added missing supported filetypes Lnk_Info migration

* added migration to lnk mimetype support

* wrong tests names

* fixed test_files path in CI

* added --malware_tools_analyzers in CI for tests

* removed CI malware analyzers because all_analyzers is present

* added temporary --malware_tools_analyzers

* added all analyzers just to make the tests pass

* added all analyzers just to make the tests pass

* added boxjs tests

* disabled mockup connetions for boxjs and strings_info

* fixed malware_analyzer_tools filename thug

* fixed deepsource missing method

* Manage missing directory

Signed-off-by: 0ssigeno <[email protected]>

* added missing pr contribution requirements

* fixed typo

* fixed typo

* added test for iocextract analyzer

* fixed pdf info without uris

* fixed assertTrue member in list with assertIn

* removed useless onenote playbook

* added playbook uris extraction

* Update pull_request_automation.yml

* changed playbook to execute

* reformatted

* changed playbook to execute

* added checks to load file data type

* updated test files

* made requested changes

* fixed migration order

* fixed migration order

* fixed migration order

* adjusted migrations, doc_info and others

* linter

* added conditional testing

* added testif

* added test for lnk file

* trying adjusting CI

* removed duplicated test

* prevent test from failing when skipping unhealthy containers tests

---------

Signed-off-by: 0ssigeno <[email protected]>
Co-authored-by: 0ssigeno <[email protected]>
Co-authored-by: Matteo Lodi <[email protected]>
vaclavbartos pushed a commit to standa4/IntelOwl that referenced this pull request Oct 13, 2024
* document_urls_extraction

* added tests and samples

* fixed follina regex

* fixed string delimiter in lnk files url extraction

* fixed onenone test

* fixed strings_info tests max chars

* fixed migration number

* added all test files

* fixed XML attacks vuln

* fixed deepsource warnings

* removed subfolder zip test files

* added missing supported filetypes Lnk_Info migration

* added migration to lnk mimetype support

* wrong tests names

* fixed test_files path in CI

* added --malware_tools_analyzers in CI for tests

* removed CI malware analyzers because all_analyzers is present

* added temporary --malware_tools_analyzers

* added all analyzers just to make the tests pass

* added all analyzers just to make the tests pass

* added boxjs tests

* disabled mockup connetions for boxjs and strings_info

* fixed malware_analyzer_tools filename thug

* fixed deepsource missing method

* Manage missing directory

Signed-off-by: 0ssigeno <[email protected]>

* added missing pr contribution requirements

* fixed typo

* fixed typo

* added test for iocextract analyzer

* fixed pdf info without uris

* fixed assertTrue member in list with assertIn

* removed useless onenote playbook

* added playbook uris extraction

* Update pull_request_automation.yml

* changed playbook to execute

* reformatted

* changed playbook to execute

* added checks to load file data type

* updated test files

* made requested changes

* fixed migration order

* fixed migration order

* fixed migration order

* adjusted migrations, doc_info and others

* linter

* added conditional testing

* added testif

* added test for lnk file

* trying adjusting CI

* removed duplicated test

* prevent test from failing when skipping unhealthy containers tests

---------

Signed-off-by: 0ssigeno <[email protected]>
Co-authored-by: 0ssigeno <[email protected]>
Co-authored-by: Matteo Lodi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants