Skip to content

nltk-data: make searchable, add all downloadables#409482

Merged
happysalada merged 8 commits intoNixOS:masterfrom
bengsparks:nltk-data
May 22, 2025
Merged

nltk-data: make searchable, add all downloadables#409482
happysalada merged 8 commits intoNixOS:masterfrom
bengsparks:nltk-data

Conversation

@bengsparks
Copy link
Contributor

@bengsparks bengsparks commented May 21, 2025

Searching for nltk-data returns no results.
Applying a similar fix as done in #402602 will hopefully fix this.

Additional QoL fixes include not attempting to patch files where there is nothing to patch and adding all downloadables that are in the nltk_data repository to the attribute set.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • Nixpkgs 25.11 Release Notes (or backporting 24.11 and 25.05 Nixpkgs Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
  • NixOS 25.11 Release Notes (or backporting 24.11 and 25.05 NixOS Release notes)
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@bengsparks bengsparks marked this pull request as draft May 21, 2025 17:07
@github-actions github-actions bot added 6.topic: python Python is a high-level, general-purpose programming language. 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` labels May 21, 2025
There are no scripts in the `package` folder, only datasets, models, etc.
nltk-data is an attribute set, which leads to nixos search omitting it.
Wrapping it in `recurseIntoAttrs` remedies this.
@github-actions github-actions bot added 10.rebuild-darwin: 101-500 This PR causes between 101 and 500 packages to rebuild on Darwin. 10.rebuild-linux: 101-500 This PR causes between 101 and 500 packages to rebuild on Linux. labels May 21, 2025
@bengsparks
Copy link
Contributor Author

nixpkgs-review result

Generated using nixpkgs-review.

Command: nixpkgs-review pr 409482


x86_64-linux

⏩ 2 packages blacklisted:
  • nixos-install-tools
  • tests.nixos-functions.nixos-test
✅ 206 packages built:
  • aider-chat-full
  • aider-chat-full.dist
  • aider-chat-with-help
  • aider-chat-with-help.dist
  • mealie
  • mealie.dist
  • nltk-data.abc
  • nltk-data.alpino
  • nltk-data.averaged-perceptron-tagger
  • nltk-data.averaged-perceptron-tagger-eng
  • nltk-data.averaged-perceptron-tagger-ru
  • nltk-data.averaged-perceptron-tagger-rus
  • nltk-data.basque-grammars
  • nltk-data.bcp47
  • nltk-data.biocreative-ppi
  • nltk-data.bllip-wsj-no-aux
  • nltk-data.book-grammars
  • nltk-data.brown
  • nltk-data.brown-tei
  • nltk-data.cess-cat
  • nltk-data.cess-esp
  • nltk-data.chat80
  • nltk-data.city-database
  • nltk-data.cmudict
  • nltk-data.comparative-sentences
  • nltk-data.comtrans
  • nltk-data.conll2000
  • nltk-data.conll2002
  • nltk-data.conll2007
  • nltk-data.crubadan
  • nltk-data.dependency-treebank
  • nltk-data.dolch
  • nltk-data.europarl-raw
  • nltk-data.extended-omw
  • nltk-data.floresta
  • nltk-data.framenet-v15
  • nltk-data.framenet-v17
  • nltk-data.gazetteers
  • nltk-data.genesis
  • nltk-data.gutenberg
  • nltk-data.ieer
  • nltk-data.inaugural
  • nltk-data.indian
  • nltk-data.jeita
  • nltk-data.kimmo
  • nltk-data.knbc
  • nltk-data.large-grammars
  • nltk-data.lin-thesaurus
  • nltk-data.mac-morpho
  • nltk-data.machado
  • nltk-data.masc-tagged
  • nltk-data.maxent-ne-chunker
  • nltk-data.maxent-ne-chunker-tab
  • nltk-data.maxent-treebank-pos-tagger
  • nltk-data.maxent-treebank-pos-tagger-tab
  • nltk-data.moses-sample
  • nltk-data.movie-reviews
  • nltk-data.mte-teip5
  • nltk-data.mwa-ppdb
  • nltk-data.names
  • nltk-data.nombank-1-0
  • nltk-data.nonbreaking-prefixes
  • nltk-data.nps-chat
  • nltk-data.omw
  • nltk-data.omw-1-4
  • nltk-data.opinion-lexicon
  • nltk-data.panlex-swadesh
  • nltk-data.paradigms
  • nltk-data.pe08
  • nltk-data.perluniprops
  • nltk-data.pil
  • nltk-data.pl196x
  • nltk-data.porter-test
  • nltk-data.ppattach
  • nltk-data.problem-reports
  • nltk-data.product-reviews-1
  • nltk-data.product-reviews-2
  • nltk-data.propbank
  • nltk-data.pros-cons
  • nltk-data.ptb
  • nltk-data.punkt
  • nltk-data.punkt-tab
  • nltk-data.qc
  • nltk-data.reuters
  • nltk-data.rslp
  • nltk-data.rte
  • nltk-data.sample-grammars
  • nltk-data.semcor
  • nltk-data.senseval
  • nltk-data.sentence-polarity
  • nltk-data.sentiwordnet
  • nltk-data.shakespeare
  • nltk-data.sinica-treebank
  • nltk-data.smultron
  • nltk-data.snowball-data
  • nltk-data.spanish-grammars
  • nltk-data.state-union
  • nltk-data.stopwords
  • nltk-data.subjectivity
  • nltk-data.swadesh
  • nltk-data.switchboard
  • nltk-data.tagsets-json
  • nltk-data.timit
  • nltk-data.toolbox
  • nltk-data.treebank
  • nltk-data.twitter-samples
  • nltk-data.udhr
  • nltk-data.udhr2
  • nltk-data.unicode-samples
  • nltk-data.universal-tagset
  • nltk-data.universal-treebanks-v20
  • nltk-data.verbnet
  • nltk-data.verbnet3
  • nltk-data.webtext
  • nltk-data.wmt15-eval
  • nltk-data.word2vec-sample
  • nltk-data.wordnet
  • nltk-data.wordnet-ic
  • nltk-data.wordnet2021
  • nltk-data.wordnet2022
  • nltk-data.wordnet31
  • nltk-data.words
  • nltk-data.ycoe
  • private-gpt
  • private-gpt.dist
  • python312Packages.dataprep-ml
  • python312Packages.dataprep-ml.dist
  • python312Packages.ingredient-parser-nlp
  • python312Packages.ingredient-parser-nlp.dist
  • python312Packages.llama-cloud-services
  • python312Packages.llama-cloud-services.dist
  • python312Packages.llama-index
  • python312Packages.llama-index-agent-openai
  • python312Packages.llama-index-agent-openai.dist
  • python312Packages.llama-index-cli
  • python312Packages.llama-index-cli.dist
  • python312Packages.llama-index-core
  • python312Packages.llama-index-core.dist
  • python312Packages.llama-index-embeddings-gemini
  • python312Packages.llama-index-embeddings-gemini.dist
  • python312Packages.llama-index-embeddings-google
  • python312Packages.llama-index-embeddings-google.dist
  • python312Packages.llama-index-embeddings-huggingface
  • python312Packages.llama-index-embeddings-huggingface.dist
  • python312Packages.llama-index-embeddings-ollama
  • python312Packages.llama-index-embeddings-ollama.dist
  • python312Packages.llama-index-embeddings-openai
  • python312Packages.llama-index-embeddings-openai.dist
  • python312Packages.llama-index-graph-stores-nebula
  • python312Packages.llama-index-graph-stores-nebula.dist
  • python312Packages.llama-index-graph-stores-neo4j
  • python312Packages.llama-index-graph-stores-neo4j.dist
  • python312Packages.llama-index-graph-stores-neptune
  • python312Packages.llama-index-graph-stores-neptune.dist
  • python312Packages.llama-index-indices-managed-llama-cloud
  • python312Packages.llama-index-indices-managed-llama-cloud.dist
  • python312Packages.llama-index-legacy
  • python312Packages.llama-index-legacy.dist
  • python312Packages.llama-index-llms-ollama
  • python312Packages.llama-index-llms-ollama.dist
  • python312Packages.llama-index-llms-openai
  • python312Packages.llama-index-llms-openai-like
  • python312Packages.llama-index-llms-openai-like.dist
  • python312Packages.llama-index-llms-openai.dist
  • python312Packages.llama-index-multi-modal-llms-openai
  • python312Packages.llama-index-multi-modal-llms-openai.dist
  • python312Packages.llama-index-program-openai
  • python312Packages.llama-index-program-openai.dist
  • python312Packages.llama-index-question-gen-openai
  • python312Packages.llama-index-question-gen-openai.dist
  • python312Packages.llama-index-readers-database
  • python312Packages.llama-index-readers-database.dist
  • python312Packages.llama-index-readers-file
  • python312Packages.llama-index-readers-file.dist
  • python312Packages.llama-index-readers-json
  • python312Packages.llama-index-readers-json.dist
  • python312Packages.llama-index-readers-llama-parse
  • python312Packages.llama-index-readers-llama-parse.dist
  • python312Packages.llama-index-readers-s3
  • python312Packages.llama-index-readers-s3.dist
  • python312Packages.llama-index-readers-twitter
  • python312Packages.llama-index-readers-twitter.dist
  • python312Packages.llama-index-readers-txtai
  • python312Packages.llama-index-readers-txtai.dist
  • python312Packages.llama-index-readers-weather
  • python312Packages.llama-index-readers-weather.dist
  • python312Packages.llama-index-vector-stores-chroma
  • python312Packages.llama-index-vector-stores-chroma.dist
  • python312Packages.llama-index-vector-stores-google
  • python312Packages.llama-index-vector-stores-google.dist
  • python312Packages.llama-index-vector-stores-postgres
  • python312Packages.llama-index-vector-stores-postgres.dist
  • python312Packages.llama-index-vector-stores-qdrant
  • python312Packages.llama-index-vector-stores-qdrant.dist
  • python312Packages.llama-index.dist
  • python312Packages.llama-parse
  • python312Packages.llama-parse.dist
  • python312Packages.mindsdb-evaluator
  • python312Packages.mindsdb-evaluator.dist
  • python312Packages.private-gpt
  • python312Packages.private-gpt.dist
  • python312Packages.type-infer
  • python312Packages.type-infer.dist
  • python313Packages.ingredient-parser-nlp
  • python313Packages.ingredient-parser-nlp.dist
  • unstructured-api

aarch64-linux

⏩ 10 packages marked as broken and skipped:
  • private-gpt
  • private-gpt.dist
  • python312Packages.llama-index
  • python312Packages.llama-index-cli
  • python312Packages.llama-index-cli.dist
  • python312Packages.llama-index-vector-stores-chroma
  • python312Packages.llama-index-vector-stores-chroma.dist
  • python312Packages.llama-index.dist
  • python312Packages.private-gpt
  • python312Packages.private-gpt.dist
⏩ 2 packages blacklisted:
  • nixos-install-tools
  • tests.nixos-functions.nixos-test
✅ 195 packages built:
  • aider-chat-full
  • aider-chat-full.dist
  • aider-chat-with-help
  • aider-chat-with-help.dist
  • mealie
  • mealie.dist
  • nltk-data.abc
  • nltk-data.alpino
  • nltk-data.averaged-perceptron-tagger
  • nltk-data.averaged-perceptron-tagger-eng
  • nltk-data.averaged-perceptron-tagger-ru
  • nltk-data.averaged-perceptron-tagger-rus
  • nltk-data.basque-grammars
  • nltk-data.bcp47
  • nltk-data.biocreative-ppi
  • nltk-data.bllip-wsj-no-aux
  • nltk-data.book-grammars
  • nltk-data.brown
  • nltk-data.brown-tei
  • nltk-data.cess-cat
  • nltk-data.cess-esp
  • nltk-data.chat80
  • nltk-data.city-database
  • nltk-data.cmudict
  • nltk-data.comparative-sentences
  • nltk-data.comtrans
  • nltk-data.conll2000
  • nltk-data.conll2002
  • nltk-data.conll2007
  • nltk-data.crubadan
  • nltk-data.dependency-treebank
  • nltk-data.dolch
  • nltk-data.europarl-raw
  • nltk-data.extended-omw
  • nltk-data.floresta
  • nltk-data.framenet-v15
  • nltk-data.framenet-v17
  • nltk-data.gazetteers
  • nltk-data.genesis
  • nltk-data.gutenberg
  • nltk-data.ieer
  • nltk-data.inaugural
  • nltk-data.indian
  • nltk-data.jeita
  • nltk-data.kimmo
  • nltk-data.knbc
  • nltk-data.large-grammars
  • nltk-data.lin-thesaurus
  • nltk-data.mac-morpho
  • nltk-data.machado
  • nltk-data.masc-tagged
  • nltk-data.maxent-ne-chunker
  • nltk-data.maxent-ne-chunker-tab
  • nltk-data.maxent-treebank-pos-tagger
  • nltk-data.maxent-treebank-pos-tagger-tab
  • nltk-data.moses-sample
  • nltk-data.movie-reviews
  • nltk-data.mte-teip5
  • nltk-data.mwa-ppdb
  • nltk-data.names
  • nltk-data.nombank-1-0
  • nltk-data.nonbreaking-prefixes
  • nltk-data.nps-chat
  • nltk-data.omw
  • nltk-data.omw-1-4
  • nltk-data.opinion-lexicon
  • nltk-data.panlex-swadesh
  • nltk-data.paradigms
  • nltk-data.pe08
  • nltk-data.perluniprops
  • nltk-data.pil
  • nltk-data.pl196x
  • nltk-data.porter-test
  • nltk-data.ppattach
  • nltk-data.problem-reports
  • nltk-data.product-reviews-1
  • nltk-data.product-reviews-2
  • nltk-data.propbank
  • nltk-data.pros-cons
  • nltk-data.ptb
  • nltk-data.punkt
  • nltk-data.punkt-tab
  • nltk-data.qc
  • nltk-data.reuters
  • nltk-data.rslp
  • nltk-data.rte
  • nltk-data.sample-grammars
  • nltk-data.semcor
  • nltk-data.senseval
  • nltk-data.sentence-polarity
  • nltk-data.sentiwordnet
  • nltk-data.shakespeare
  • nltk-data.sinica-treebank
  • nltk-data.smultron
  • nltk-data.snowball-data
  • nltk-data.spanish-grammars
  • nltk-data.state-union
  • nltk-data.stopwords
  • nltk-data.subjectivity
  • nltk-data.swadesh
  • nltk-data.switchboard
  • nltk-data.tagsets-json
  • nltk-data.timit
  • nltk-data.toolbox
  • nltk-data.treebank
  • nltk-data.twitter-samples
  • nltk-data.udhr
  • nltk-data.udhr2
  • nltk-data.unicode-samples
  • nltk-data.universal-tagset
  • nltk-data.universal-treebanks-v20
  • nltk-data.verbnet
  • nltk-data.verbnet3
  • nltk-data.webtext
  • nltk-data.wmt15-eval
  • nltk-data.word2vec-sample
  • nltk-data.wordnet
  • nltk-data.wordnet-ic
  • nltk-data.wordnet2021
  • nltk-data.wordnet2022
  • nltk-data.wordnet31
  • nltk-data.words
  • nltk-data.ycoe
  • python312Packages.dataprep-ml
  • python312Packages.dataprep-ml.dist
  • python312Packages.ingredient-parser-nlp
  • python312Packages.ingredient-parser-nlp.dist
  • python312Packages.llama-cloud-services
  • python312Packages.llama-cloud-services.dist
  • python312Packages.llama-index-agent-openai
  • python312Packages.llama-index-agent-openai.dist
  • python312Packages.llama-index-core
  • python312Packages.llama-index-core.dist
  • python312Packages.llama-index-embeddings-gemini
  • python312Packages.llama-index-embeddings-gemini.dist
  • python312Packages.llama-index-embeddings-google
  • python312Packages.llama-index-embeddings-google.dist
  • python312Packages.llama-index-embeddings-huggingface
  • python312Packages.llama-index-embeddings-huggingface.dist
  • python312Packages.llama-index-embeddings-ollama
  • python312Packages.llama-index-embeddings-ollama.dist
  • python312Packages.llama-index-embeddings-openai
  • python312Packages.llama-index-embeddings-openai.dist
  • python312Packages.llama-index-graph-stores-nebula
  • python312Packages.llama-index-graph-stores-nebula.dist
  • python312Packages.llama-index-graph-stores-neo4j
  • python312Packages.llama-index-graph-stores-neo4j.dist
  • python312Packages.llama-index-graph-stores-neptune
  • python312Packages.llama-index-graph-stores-neptune.dist
  • python312Packages.llama-index-indices-managed-llama-cloud
  • python312Packages.llama-index-indices-managed-llama-cloud.dist
  • python312Packages.llama-index-legacy
  • python312Packages.llama-index-legacy.dist
  • python312Packages.llama-index-llms-ollama
  • python312Packages.llama-index-llms-ollama.dist
  • python312Packages.llama-index-llms-openai
  • python312Packages.llama-index-llms-openai-like
  • python312Packages.llama-index-llms-openai-like.dist
  • python312Packages.llama-index-llms-openai.dist
  • python312Packages.llama-index-multi-modal-llms-openai
  • python312Packages.llama-index-multi-modal-llms-openai.dist
  • python312Packages.llama-index-program-openai
  • python312Packages.llama-index-program-openai.dist
  • python312Packages.llama-index-question-gen-openai
  • python312Packages.llama-index-question-gen-openai.dist
  • python312Packages.llama-index-readers-database
  • python312Packages.llama-index-readers-database.dist
  • python312Packages.llama-index-readers-file
  • python312Packages.llama-index-readers-file.dist
  • python312Packages.llama-index-readers-json
  • python312Packages.llama-index-readers-json.dist
  • python312Packages.llama-index-readers-llama-parse
  • python312Packages.llama-index-readers-llama-parse.dist
  • python312Packages.llama-index-readers-s3
  • python312Packages.llama-index-readers-s3.dist
  • python312Packages.llama-index-readers-twitter
  • python312Packages.llama-index-readers-twitter.dist
  • python312Packages.llama-index-readers-txtai
  • python312Packages.llama-index-readers-txtai.dist
  • python312Packages.llama-index-readers-weather
  • python312Packages.llama-index-readers-weather.dist
  • python312Packages.llama-index-vector-stores-google
  • python312Packages.llama-index-vector-stores-google.dist
  • python312Packages.llama-index-vector-stores-postgres
  • python312Packages.llama-index-vector-stores-postgres.dist
  • python312Packages.llama-index-vector-stores-qdrant
  • python312Packages.llama-index-vector-stores-qdrant.dist
  • python312Packages.llama-parse
  • python312Packages.llama-parse.dist
  • python312Packages.mindsdb-evaluator
  • python312Packages.mindsdb-evaluator.dist
  • python312Packages.type-infer
  • python312Packages.type-infer.dist
  • python313Packages.ingredient-parser-nlp
  • python313Packages.ingredient-parser-nlp.dist

@bengsparks bengsparks marked this pull request as ready for review May 21, 2025 20:33
Copy link
Contributor

@antonmosich antonmosich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. It feels a bit odd to have all the makeCorpus etc. calls done manually instead of using something map-ish, but it isn't obvious to me how to do that in a concise way.

@happysalada happysalada merged commit ef31402 into NixOS:master May 22, 2025
18 of 21 checks passed
@happysalada
Copy link
Contributor

Nice improvement ! And welcome as a maintainer !

@winterqt
Copy link
Member

This broke alias eval, revert in #409843.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 6.topic: python Python is a high-level, general-purpose programming language. 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 101-500 This PR causes between 101 and 500 packages to rebuild on Darwin. 10.rebuild-linux: 101-500 This PR causes between 101 and 500 packages to rebuild on Linux.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants