opentapioca Docker & OpenTapioca

Would it be possible to have a Docker image to help testing/deploying OpenTapioca? It could be a great feature to help new community to enter into OpenTapioca world.

Jan 19 '21 19:01 kerphi

It would be great to have that indeed! I am unlikely to find the time to work on this soon but would be very much in favour of including that in the repository.

Jan 19 '21 20:01 wetneb

@wetneb Hey, could you specify which version of zookeeper you are using, and what is your local config? Maybe it would be cool to have a series of steps for your specific zookeeper install procedure.

Aug 04 '22 15:08 eracle

I use Solr 7.7.3 and the Zookeeper that is bundled in it. I do not install Zookeeper itself, I just download Solr and that comes with Zookeeper in it.

Aug 04 '22 16:08 wetneb

Before Solr version 8.11.1, the Log4Shell CVE is present and it is a security problem. Do you think your project would also work with Solr 8 on up?

Aug 05 '22 14:08 eracle

I have not checked. I am not actively maintaining this project as you can see. But I will always be happy to merge PRs.

Aug 05 '22 14:08 wetneb

Ok I kinda solved the previous problem. I will have a PR ready soon. One question, should I update the settings_template.py file:



# The name of the Solr collection where Wikidata is indexed
SOLR_COLLECTION = 'wd_2019-02-24'

# The path to the language model, trained with "tapioca train-bow"
LANGUAGE_MODEL_PATH='data/wd_2019-02-24.bow.pkl'
# The path to the pagerank Numpy vector, computed with "tapioca compute-pagerank"
PAGERANK_PATH='data/wd_2019-02-24.pgrank.npy'
# The path to the trained classifier, obtained from "tapioca train-classifier"
CLASSIFIER_PATH='data/rss_istex_classifier.pkl'

Aug 05 '22 16:08 eracle

I am not sure what you want to change in the settings_template.py, but I assume that if you want to change things there, you probably have a good reason to :)

Aug 05 '22 16:08 wetneb

The CLI was asking me something about the settings.py file that probably is not included in the docs. Should I copy the settings_template.py file and rename it as settings.py?

Another question: the following command:

tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json

What's my_collection_name? Could you provide some examples of its value?

Aug 06 '22 23:08 eracle

The CLI was asking me something about the settings.py file that probably is not included in the docs. Should I copy the settings_template.py file and rename it as settings.py?

Indeed! And feel free to have a look at its contents and check if there is anything there that you want to change for your own purposes.

Another question

The docs say:

Pick a Solr collection name (without creating the collection in advance) and run: tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json

So the intention behind this sentence is to say that:

you can come up with a collection name of your own and it can be arbitrary. For instance, bubble_tea_is_overrated could be a nice collection name, just like a_little_waltz_in_the_park would be a nice one too.
Once you have made up your mind about your collection name, you can insert it in the command mentioned in the docs. For instance: tapioca index-dump a_little_waltz_in_the_park latest-all.json.bz2 --profile profiles/human_organization_place.json

If you can think of ways to make the docs more understandable for you in both locations, do not hesitate to open a PR with the phrasing you would have preferred there, I am sure it is going to be much better.

Aug 07 '22 07:08 wetneb

@wetneb Hi Antonin, I am testing the branch on my personal server and at the moment I am running the indexing. Unfortunately, Solr stops being killed by the Operating System since it uses too much memory. It looks like during indexing some memory leakage or something similar happens on Solr cloud. How much memory did you use to have on your server?

I also notice there is the parameter skip_docs Do you use to manually restart the indexing process by passing last failing skip_docs number?

Aug 18 '22 20:08 eracle

Hi @eracle,

On my previous server I had 20+GB RAM. Now much less, so I can no longer update the index.

Yes I suspect skip_docs can be used to resume the indexing from an offset, but I do not remember exactly.

Aug 19 '22 08:08 wetneb