You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* check sample submission & add package constraint
* add trace.log into clear
* change default
* simplify
* clear CI workspace before running
* move to CI
* use sudo to clean workspace
* move prepare out of global var
---------
Co-authored-by: Xu Yang <[email protected]>
Copy file name to clipboardExpand all lines: rdagent/scenarios/data_science/proposal/exp_gen/prompts_v2.yaml
+6Lines changed: 6 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -306,6 +306,12 @@ task_gen:
306
306
- When only one model is used, its score should be present, and an "ensemble" score (which would be the same as the single model's score in this case) must also be recorded.
307
307
- Ensure validation metrics and processes are consistent across all parts of the pipeline. Avoid changes that would alter how validation metrics are calculated unless that is part of the hypothesis.
308
308
8. **Submission File (`submission.csv`)**: Generate `submission.csv` in the **exact format** required (column names, order, data types), as detailed by `sample_submission.csv` in the `Competition Scenario Description`. This is a critical step.
309
+
9. **Preferred Packages Notes**:
310
+
- You can choose the most proper packages for the task to best achieve the hypothesis.
311
+
- When facing a choice between two packages which both can achieve the same goal, you should choose the one which is more commonly used and less likely to cause bugs in coding. Especially those you are not familiar with.
312
+
- For GBDT models, prefer XGBoost or RandomForest over LightGBM unless the SOTA or hypothesis dictates otherwise.
313
+
- For neural networks, prefer PyTorch or PyTorch based library (over TensorFlow) unless the SOTA or hypothesis dictates otherwise.
314
+
- For neural networks, prefer fine-tuning pre-trained models over training from scratch.
Copy file name to clipboardExpand all lines: rdagent/scenarios/data_science/share.yaml
+16-12Lines changed: 16 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -264,44 +264,41 @@ component_spec:
264
264
{% endraw %}
265
265
266
266
Pipeline: |-
267
-
0. Program Execution:
267
+
1. Program Execution:
268
268
- The workflow will be executed by running `python main.py` with no command-line arguments. Ensure that `main.py` does not require or expect any parameters.
269
269
- The working directory will only contain `main.py`. Any additional files required for execution must be downloaded or generated by `main.py` itself.
270
270
271
-
1. File Handling:
271
+
2. File Handling:
272
272
- Handle file encoding and delimiters appropriately.
273
273
- Combine or process multiple files if necessary.
274
274
- Avoid using the sample submission file to infer test indices. If a dedicated test index file is available, use that. If not, use the order in the test file as the test index.
275
275
- Ensure you load the actual data from the files, not just the filenames or paths. Do not postpone data loading to later steps.
276
276
277
-
2. Data Preprocessing:
277
+
3. Data Preprocessing:
278
278
- Convert data types correctly (e.g., numeric, categorical, date parsing).
279
279
- Optimize memory usage for large datasets using techniques like downcasting or reading data in chunks if necessary.
280
280
- Domain-Specific Handling:
281
281
- Apply competition-specific preprocessing steps as needed (e.g., text tokenization, image resizing).
282
282
283
-
3. Code Standards:
283
+
4. Code Standards:
284
284
- DO NOT use progress bars (e.g., `tqdm`).
285
285
- DO NOT use the sample submission file to extract test index information.
286
286
- DO NOT exclude features inadvertently during this process.
287
287
288
-
4. NOTES
288
+
5. NOTES
289
289
- Never use sample submission as the test index, as it may not be the same as the test data. Use the test index file or test data source to get the test index.
290
-
- For neural network models, use pytorch rather than tensorflow as the backend if possible.
291
-
- For decision tree models, use xgboost or RandomForest rather than lightgbm as the backend if possible.
292
-
- For neural network models, it's always better to firstly try from a pretrained model and then fine-tune it rather than training from scratch.
293
290
294
-
5. General Considerations:
291
+
6. General Considerations:
295
292
- Ensure scalability for large datasets.
296
293
- Handle missing values and outliers appropriately (e.g., impute, remove, or replace).
297
294
- Ensure consistency between feature data types and transformations.
298
295
- Prevent data leakage: Do not use information derived from the test set when transforming training data.
299
296
- Sampling a subset of the training data for efficiency (e.g., randomly selecting a portion of the data) is discouraged unless it demonstrably improves performance (e.g., removing irrelevant or outlier samples).
300
297
301
-
6. Notes:
298
+
7. Notes:
302
299
- GPU and multiprocessing are available and are encouraged to use for accelerating transformations.
303
300
304
-
7. Metric Calculation and Storage:
301
+
8. Metric Calculation and Storage:
305
302
- Calculate the metric (mentioned in the evaluation section of the competition information) for each model and ensemble strategy on valid, and save the results in `scores.csv`
306
303
- The evaluation should be based on k-fold cross-validation but only if that's an appropriate evaluation for the task at hand. Store the mean validation score of k-fold cross-validation in `scores.csv` on each model. Refer to the hyperparameter specification for rules to set the CV folds.
307
304
- Even if only one model is present, compute the ensemble score and store it under `"ensemble"`.
@@ -311,9 +308,16 @@ component_spec:
311
308
- <metric_name>: The calculated metric value for that model or ensemble strategy. The metric name can be found in the scenario description. The metric name should be exactly the same as the one in the scenario description since user will use it to check the result.
312
309
- Validation metrics should be aligned across all ideas and implementations. Avoid proposing ideas that might affect the validation metrics and modifying the related code.
313
310
314
-
8. Submission File:
311
+
9. Submission File:
315
312
- Save the final predictions as `submission.csv`, ensuring the format matches the competition requirements (refer to `sample_submission` in the Folder Description for the correct structure).
316
313
- Present the required submission format explicitly and ensure the output adheres to it.
314
+
315
+
10. Preferred Packages:
316
+
- You can choose the most proper packages to achieve the task.
317
+
- When facing a choice between two packages which both can achieve the same goal, you should choose the one which is more commonly used and less likely to cause bugs in coding. Especially those you are not familiar with.
318
+
- For GBDT models, prefer XGBoost or RandomForest over LightGBM unless the SOTA or hypothesis dictates otherwise.
319
+
- For neural networks, prefer PyTorch or PyTorch based library (over TensorFlow) unless the SOTA or hypothesis dictates otherwise.
320
+
- For neural networks, prefer fine-tuning pre-trained models over training from scratch.
0 commit comments