Upload folder using huggingface_hub
Browse files- README.md +61 -36
- blocks.py +3 -0
- dataset.py +4 -7
- fusion.py +2 -2
- inference.py +5 -5
- llm_as_judge.py +5 -1
- loaders.py +74 -12
- metric.py +2 -6
- metric_utils.py +7 -8
- operator.py +5 -5
- operators.py +74 -6
- processors.py +8 -0
- settings_utils.py +1 -0
- split_utils.py +4 -2
- splitters.py +16 -6
- stream.py +98 -7
- stream_operators.py +126 -0
- struct_data_operators.py +39 -0
- text_utils.py +25 -6
README.md
CHANGED
@@ -1,50 +1,75 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
- none
|
5 |
-
tags:
|
6 |
-
- evaluate
|
7 |
-
- metric
|
8 |
-
description: "TODO: add a description here"
|
9 |
-
sdk: gradio
|
10 |
-
sdk_version: 3.19.1
|
11 |
-
app_file: app.py
|
12 |
-
pinned: false
|
13 |
-
---
|
14 |
|
15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |
-
***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
|
18 |
|
19 |
-
|
20 |
-
*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
|
21 |
|
22 |
-
|
23 |
-
*Give general statement of how to use the metric*
|
24 |
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
-
|
28 |
-
*List all input arguments in the format below*
|
29 |
-
- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
|
30 |
|
31 |
-
|
32 |
|
33 |
-
|
34 |
|
35 |
-
|
|
|
|
|
|
|
|
|
36 |
|
37 |
-
|
38 |
-
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
|
39 |
|
40 |
-
|
41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
|
43 |
-
|
44 |
-
*Note any known limitations or biases that the metric has, with links and references if possible.*
|
45 |
|
46 |
-
|
47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
|
49 |
-
## Further References
|
50 |
-
*Add any useful further references.*
|
|
|
1 |
+
<div align="center">
|
2 |
+
<img src="./assets/banner.png" alt="Image Description" width="100%" />
|
3 |
+
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
|
5 |
+
[![Button](https://img.shields.io/badge/Video-pink?style=for-the-badge)](https://unitxt.readthedocs.io/)
|
6 |
+
[![Button](https://img.shields.io/badge/Demo-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/docs/demo.html)
|
7 |
+
[![Button](https://img.shields.io/badge/Tutorial-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/docs/adding_dataset.html)
|
8 |
+
[![Button](https://img.shields.io/badge/Paper-pink?style=for-the-badge)](https://arxiv.org/abs/2401.14019)
|
9 |
+
[![Button](https://img.shields.io/badge/Documentation-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/modules.html)
|
10 |
+
[![Button](https://img.shields.io/badge/Catalog-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/catalog/catalog.__dir__.html)
|
11 |
+
[![Button](https://img.shields.io/badge/Contributors-pink?style=for-the-badge)](https://github.com/IBM/unitxt/blob/main/CONTRIBUTING.md)
|
12 |
+
[![Button](https://img.shields.io/badge/PyPi-pink?style=for-the-badge)](https://pypi.org/project/unitxt/)
|
13 |
|
|
|
14 |
|
15 |
+
In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution.
|
|
|
16 |
|
17 |
+
Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively.
|
|
|
18 |
|
19 |
+
#
|
20 |
+
[![version](https://img.shields.io/pypi/v/unitxt)](https://pypi.org/project/unitxt/)
|
21 |
+
![license](https://img.shields.io/github/license/ibm/unitxt)
|
22 |
+
![python](https://img.shields.io/badge/python-3.8%20|%203.9-blue)
|
23 |
+
![tests](https://img.shields.io/github/actions/workflow/status/ibm/unitxt/library_tests.yml?branch=main&label=tests)
|
24 |
+
[![codecov](https://codecov.io/gh/IBM/unitxt/branch/main/graph/badge.svg?token=mlrWq9cwz3)](https://codecov.io/gh/IBM/unitxt)
|
25 |
+
![Read the Docs](https://img.shields.io/readthedocs/unitxt)
|
26 |
+
[![downloads](https://static.pepy.tech/personalized-badge/unitxt?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/unitxt)
|
27 |
|
28 |
+
#
|
|
|
|
|
29 |
|
30 |
+
https://github.com/IBM/unitxt/assets/23455264/baef9131-39d4-4164-90b2-05da52919fdf
|
31 |
|
32 |
+
### 🦄 Currently on Unitxt Catalog
|
33 |
|
34 |
+
![NLP Tasks](https://img.shields.io/badge/NLP_tasks-40-blue)
|
35 |
+
![Dataset Cards](https://img.shields.io/badge/Dataset_Cards-457-blue)
|
36 |
+
![Templates](https://img.shields.io/badge/Templates-229-blue)
|
37 |
+
![Formats](https://img.shields.io/badge/Formats-18-blue)
|
38 |
+
![Metrics](https://img.shields.io/badge/Metrics-98-blue)
|
39 |
|
40 |
+
### 🦄 Run Unitxt Exploration Dashboard
|
|
|
41 |
|
42 |
+
To launch unitxt graphical user interface first install unitxt with ui requirements:
|
43 |
+
```
|
44 |
+
pip install unitxt[ui]
|
45 |
+
```
|
46 |
+
Then launch the ui by running:
|
47 |
+
```
|
48 |
+
unitxt-explore
|
49 |
+
```
|
50 |
|
51 |
+
# 🦄 Contributors
|
|
|
52 |
|
53 |
+
Please install Unitxt from source by:
|
54 |
+
```
|
55 |
+
git clone [email protected]:IBM/unitxt.git
|
56 |
+
cd unitxt
|
57 |
+
pip install -e ".[dev]"
|
58 |
+
pre-commit install
|
59 |
+
```
|
60 |
+
|
61 |
+
# 🦄 Citation
|
62 |
+
|
63 |
+
If you use Unitxt in your research, please cite our paper:
|
64 |
+
|
65 |
+
```
|
66 |
+
@misc{unitxt,
|
67 |
+
title={Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI},
|
68 |
+
author={Elron Bandel and Yotam Perlitz and Elad Venezian and Roni Friedman-Melamed and Ofir Arviv and Matan Orbach and Shachar Don-Yehyia and Dafna Sheinwald and Ariel Gera and Leshem Choshen and Michal Shmueli-Scheuer and Yoav Katz},
|
69 |
+
year={2024},
|
70 |
+
eprint={2401.14019},
|
71 |
+
archivePrefix={arXiv},
|
72 |
+
primaryClass={cs.CL}
|
73 |
+
}
|
74 |
+
```
|
75 |
|
|
|
|
blocks.py
CHANGED
@@ -22,8 +22,11 @@ from .splitters import RandomSampler, SliceSplit, SplitRandomMix, SpreadSplit
|
|
22 |
from .stream import MultiStream
|
23 |
from .struct_data_operators import (
|
24 |
ListToKeyValPairs,
|
|
|
25 |
SerializeKeyValPairs,
|
|
|
26 |
SerializeTableAsIndexedRowMajor,
|
|
|
27 |
SerializeTableAsMarkdown,
|
28 |
SerializeTableRowAsList,
|
29 |
SerializeTableRowAsText,
|
|
|
22 |
from .stream import MultiStream
|
23 |
from .struct_data_operators import (
|
24 |
ListToKeyValPairs,
|
25 |
+
MapHTMLTableToJSON,
|
26 |
SerializeKeyValPairs,
|
27 |
+
SerializeTableAsDFLoader,
|
28 |
SerializeTableAsIndexedRowMajor,
|
29 |
+
SerializeTableAsJson,
|
30 |
SerializeTableAsMarkdown,
|
31 |
SerializeTableRowAsList,
|
32 |
SerializeTableRowAsText,
|
dataset.py
CHANGED
@@ -10,7 +10,6 @@ from .catalog import __file__ as _
|
|
10 |
from .collections import __file__ as _
|
11 |
from .collections_operators import __file__ as _
|
12 |
from .dataclass import __file__ as _
|
13 |
-
from .dataset_utils import __file__ as _
|
14 |
from .dataset_utils import get_dataset_artifact
|
15 |
from .deprecation_utils import __file__ as _
|
16 |
from .dialog_operators import __file__ as _
|
@@ -20,13 +19,11 @@ from .file_utils import __file__ as _
|
|
20 |
from .formats import __file__ as _
|
21 |
from .fusion import __file__ as _
|
22 |
from .generator_utils import __file__ as _
|
23 |
-
from .hf_utils import __file__ as _
|
24 |
from .hf_utils import verify_versions_compatibility
|
25 |
from .inference import __file__ as _
|
26 |
from .instructions import __file__ as _
|
27 |
from .llm_as_judge import __file__ as _
|
28 |
from .loaders import __file__ as _
|
29 |
-
from .logging_utils import __file__ as _
|
30 |
from .logging_utils import get_logger
|
31 |
from .metric import __file__ as _
|
32 |
from .metric_utils import __file__ as _
|
@@ -40,13 +37,13 @@ from .random_utils import __file__ as _
|
|
40 |
from .recipe import __file__ as _
|
41 |
from .register import __file__ as _
|
42 |
from .schema import __file__ as _
|
43 |
-
from .settings_utils import __file__ as _
|
44 |
from .settings_utils import get_constants
|
45 |
from .span_lableing_operators import __file__ as _
|
46 |
from .split_utils import __file__ as _
|
47 |
from .splitters import __file__ as _
|
48 |
from .standard import __file__ as _
|
49 |
from .stream import __file__ as _
|
|
|
50 |
from .string_operators import __file__ as _
|
51 |
from .struct_data_operators import __file__ as _
|
52 |
from .system_prompts import __file__ as _
|
@@ -54,7 +51,6 @@ from .task import __file__ as _
|
|
54 |
from .templates import __file__ as _
|
55 |
from .text_utils import __file__ as _
|
56 |
from .type_utils import __file__ as _
|
57 |
-
from .utils import __file__ as _
|
58 |
from .utils import is_package_installed
|
59 |
from .validate import __file__ as _
|
60 |
from .version import __file__ as _
|
@@ -75,8 +71,9 @@ class Dataset(datasets.GeneratorBasedBuilder):
|
|
75 |
if is_package_installed("unitxt"):
|
76 |
verify_versions_compatibility("dataset", self.VERSION)
|
77 |
|
78 |
-
from unitxt.dataset_utils import
|
79 |
-
get_dataset_artifact as get_dataset_artifact_installed
|
|
|
80 |
|
81 |
logger.info("Loading with installed unitxt library...")
|
82 |
dataset = get_dataset_artifact_installed(self.config.name)
|
|
|
10 |
from .collections import __file__ as _
|
11 |
from .collections_operators import __file__ as _
|
12 |
from .dataclass import __file__ as _
|
|
|
13 |
from .dataset_utils import get_dataset_artifact
|
14 |
from .deprecation_utils import __file__ as _
|
15 |
from .dialog_operators import __file__ as _
|
|
|
19 |
from .formats import __file__ as _
|
20 |
from .fusion import __file__ as _
|
21 |
from .generator_utils import __file__ as _
|
|
|
22 |
from .hf_utils import verify_versions_compatibility
|
23 |
from .inference import __file__ as _
|
24 |
from .instructions import __file__ as _
|
25 |
from .llm_as_judge import __file__ as _
|
26 |
from .loaders import __file__ as _
|
|
|
27 |
from .logging_utils import get_logger
|
28 |
from .metric import __file__ as _
|
29 |
from .metric_utils import __file__ as _
|
|
|
37 |
from .recipe import __file__ as _
|
38 |
from .register import __file__ as _
|
39 |
from .schema import __file__ as _
|
|
|
40 |
from .settings_utils import get_constants
|
41 |
from .span_lableing_operators import __file__ as _
|
42 |
from .split_utils import __file__ as _
|
43 |
from .splitters import __file__ as _
|
44 |
from .standard import __file__ as _
|
45 |
from .stream import __file__ as _
|
46 |
+
from .stream_operators import __file__ as _
|
47 |
from .string_operators import __file__ as _
|
48 |
from .struct_data_operators import __file__ as _
|
49 |
from .system_prompts import __file__ as _
|
|
|
51 |
from .templates import __file__ as _
|
52 |
from .text_utils import __file__ as _
|
53 |
from .type_utils import __file__ as _
|
|
|
54 |
from .utils import is_package_installed
|
55 |
from .validate import __file__ as _
|
56 |
from .version import __file__ as _
|
|
|
71 |
if is_package_installed("unitxt"):
|
72 |
verify_versions_compatibility("dataset", self.VERSION)
|
73 |
|
74 |
+
from unitxt.dataset_utils import (
|
75 |
+
get_dataset_artifact as get_dataset_artifact_installed,
|
76 |
+
)
|
77 |
|
78 |
logger.info("Loading with installed unitxt library...")
|
79 |
dataset = get_dataset_artifact_installed(self.config.name)
|
fusion.py
CHANGED
@@ -4,7 +4,7 @@ from typing import Dict, Generator, List, Optional, Union
|
|
4 |
from .dataclass import NonPositionalField
|
5 |
from .operator import SourceOperator
|
6 |
from .random_utils import new_random_generator
|
7 |
-
from .stream import
|
8 |
from .type_utils import isoftype
|
9 |
|
10 |
|
@@ -49,7 +49,7 @@ class BaseFusion(SourceOperator):
|
|
49 |
) -> MultiStream:
|
50 |
result = {}
|
51 |
for split in self.splits():
|
52 |
-
result[split] =
|
53 |
self.fusion_generator, gen_kwargs={"split": split}
|
54 |
)
|
55 |
return MultiStream(result)
|
|
|
4 |
from .dataclass import NonPositionalField
|
5 |
from .operator import SourceOperator
|
6 |
from .random_utils import new_random_generator
|
7 |
+
from .stream import DynamicStream, MultiStream
|
8 |
from .type_utils import isoftype
|
9 |
|
10 |
|
|
|
49 |
) -> MultiStream:
|
50 |
result = {}
|
51 |
for split in self.splits():
|
52 |
+
result[split] = DynamicStream(
|
53 |
self.fusion_generator, gen_kwargs={"split": split}
|
54 |
)
|
55 |
return MultiStream(result)
|
inference.py
CHANGED
@@ -121,8 +121,7 @@ class IbmGenAiInferenceEngine(InferenceEngine, PackageRequirementsMixin):
|
|
121 |
f"Error while trying to run IbmGenAiInferenceEngine."
|
122 |
f" Please set the environment param '{api_key_env_var_name}'."
|
123 |
)
|
124 |
-
|
125 |
-
credentials = Credentials(api_key=api_key, api_endpoint=api_endpoint)
|
126 |
self.client = Client(credentials=credentials)
|
127 |
|
128 |
def _infer(self, dataset):
|
@@ -141,13 +140,14 @@ class IbmGenAiInferenceEngine(InferenceEngine, PackageRequirementsMixin):
|
|
141 |
decoding_method=self.parameters.decoding_method,
|
142 |
)
|
143 |
|
144 |
-
return
|
145 |
-
|
|
|
146 |
model_id=self.model_name,
|
147 |
inputs=[instance["source"] for instance in dataset],
|
148 |
parameters=genai_params,
|
149 |
)
|
150 |
-
|
151 |
|
152 |
|
153 |
class OpenAiInferenceEngineParams(Artifact):
|
|
|
121 |
f"Error while trying to run IbmGenAiInferenceEngine."
|
122 |
f" Please set the environment param '{api_key_env_var_name}'."
|
123 |
)
|
124 |
+
credentials = Credentials(api_key=api_key)
|
|
|
125 |
self.client = Client(credentials=credentials)
|
126 |
|
127 |
def _infer(self, dataset):
|
|
|
140 |
decoding_method=self.parameters.decoding_method,
|
141 |
)
|
142 |
|
143 |
+
return [
|
144 |
+
response.results[0].generated_text
|
145 |
+
for response in self.client.text.generation.create(
|
146 |
model_id=self.model_name,
|
147 |
inputs=[instance["source"] for instance in dataset],
|
148 |
parameters=genai_params,
|
149 |
)
|
150 |
+
]
|
151 |
|
152 |
|
153 |
class OpenAiInferenceEngineParams(Artifact):
|
llm_as_judge.py
CHANGED
@@ -135,4 +135,8 @@ class LLMAsJudge(BulkInstanceMetric):
|
|
135 |
dataset = produce(instances, recipe)
|
136 |
verdicts = self.inference_model.infer(dataset)
|
137 |
meta_scores = evaluate(predictions=verdicts, data=dataset)
|
138 |
-
return [
|
|
|
|
|
|
|
|
|
|
135 |
dataset = produce(instances, recipe)
|
136 |
verdicts = self.inference_model.infer(dataset)
|
137 |
meta_scores = evaluate(predictions=verdicts, data=dataset)
|
138 |
+
return [
|
139 |
+
{self.main_score: instance["prediction"], "judge_raw_output": verdict}
|
140 |
+
for instance in meta_scores
|
141 |
+
for verdict in verdicts
|
142 |
+
]
|
loaders.py
CHANGED
@@ -30,6 +30,7 @@ Available Loaders Overview:
|
|
30 |
|
31 |
------------------------
|
32 |
"""
|
|
|
33 |
import itertools
|
34 |
import os
|
35 |
import tempfile
|
@@ -41,6 +42,7 @@ from typing import Any, Dict, List, Mapping, Optional, Sequence, Union
|
|
41 |
|
42 |
import pandas as pd
|
43 |
from datasets import load_dataset as hf_load_dataset
|
|
|
44 |
from tqdm import tqdm
|
45 |
|
46 |
from .dataclass import InternalField, OptionalField
|
@@ -49,7 +51,7 @@ from .logging_utils import get_logger
|
|
49 |
from .operator import SourceOperator
|
50 |
from .operators import AddFields
|
51 |
from .settings_utils import get_settings
|
52 |
-
from .stream import
|
53 |
|
54 |
logger = get_logger()
|
55 |
settings = get_settings()
|
@@ -259,7 +261,7 @@ class LoadHF(Loader):
|
|
259 |
self.log_limited_loading()
|
260 |
return MultiStream(
|
261 |
{
|
262 |
-
name:
|
263 |
generator=self.split_limited_load, gen_kwargs={"split_name": name}
|
264 |
)
|
265 |
for name in self._cache.keys()
|
@@ -349,7 +351,7 @@ class LoadCSV(Loader):
|
|
349 |
if self.streaming:
|
350 |
return MultiStream(
|
351 |
{
|
352 |
-
name:
|
353 |
generator=self.stream_csv, gen_kwargs={"file": file}
|
354 |
)
|
355 |
for name, file in self.files.items()
|
@@ -358,9 +360,7 @@ class LoadCSV(Loader):
|
|
358 |
|
359 |
return MultiStream(
|
360 |
{
|
361 |
-
name:
|
362 |
-
generator=self.load_csv, gen_kwargs={"file": file}
|
363 |
-
)
|
364 |
for name, file in self.files.items()
|
365 |
}
|
366 |
)
|
@@ -385,7 +385,6 @@ class LoadFromSklearn(Loader):
|
|
385 |
|
386 |
dataset_name: str
|
387 |
splits: List[str] = ["train", "test"]
|
388 |
-
data_classification_policy = ["public"]
|
389 |
|
390 |
_requirements_list: List[str] = ["sklearn", "pandas"]
|
391 |
|
@@ -683,8 +682,10 @@ class LoadFromDictionary(Loader):
|
|
683 |
.. code-block:: python
|
684 |
|
685 |
data = {
|
686 |
-
"train": {"input": "SomeInput1", "output": "SomeResult1"},
|
687 |
-
|
|
|
|
|
688 |
}
|
689 |
loader = LoadFromDictionary(data=data)
|
690 |
"""
|
@@ -794,18 +795,79 @@ class LoadFromHFSpace(LoadHF):
|
|
794 |
else:
|
795 |
data_files = self.data_files
|
796 |
|
|
|
797 |
for files in data_files:
|
798 |
if isinstance(files, str):
|
799 |
files = [files]
|
800 |
-
|
801 |
paths = [self._download_file_from_space(file) for file in files]
|
802 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
803 |
|
804 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
805 |
|
806 |
def load_data(self):
|
807 |
self.sef_default_data_classification(
|
808 |
["public"], "when loading from Huggingface spaces"
|
809 |
)
|
|
|
810 |
self.path = self._download_data()
|
811 |
return super().load_data()
|
|
|
30 |
|
31 |
------------------------
|
32 |
"""
|
33 |
+
import fnmatch
|
34 |
import itertools
|
35 |
import os
|
36 |
import tempfile
|
|
|
42 |
|
43 |
import pandas as pd
|
44 |
from datasets import load_dataset as hf_load_dataset
|
45 |
+
from huggingface_hub import HfApi
|
46 |
from tqdm import tqdm
|
47 |
|
48 |
from .dataclass import InternalField, OptionalField
|
|
|
51 |
from .operator import SourceOperator
|
52 |
from .operators import AddFields
|
53 |
from .settings_utils import get_settings
|
54 |
+
from .stream import DynamicStream, MultiStream
|
55 |
|
56 |
logger = get_logger()
|
57 |
settings = get_settings()
|
|
|
261 |
self.log_limited_loading()
|
262 |
return MultiStream(
|
263 |
{
|
264 |
+
name: DynamicStream(
|
265 |
generator=self.split_limited_load, gen_kwargs={"split_name": name}
|
266 |
)
|
267 |
for name in self._cache.keys()
|
|
|
351 |
if self.streaming:
|
352 |
return MultiStream(
|
353 |
{
|
354 |
+
name: DynamicStream(
|
355 |
generator=self.stream_csv, gen_kwargs={"file": file}
|
356 |
)
|
357 |
for name, file in self.files.items()
|
|
|
360 |
|
361 |
return MultiStream(
|
362 |
{
|
363 |
+
name: DynamicStream(generator=self.load_csv, gen_kwargs={"file": file})
|
|
|
|
|
364 |
for name, file in self.files.items()
|
365 |
}
|
366 |
)
|
|
|
385 |
|
386 |
dataset_name: str
|
387 |
splits: List[str] = ["train", "test"]
|
|
|
388 |
|
389 |
_requirements_list: List[str] = ["sklearn", "pandas"]
|
390 |
|
|
|
682 |
.. code-block:: python
|
683 |
|
684 |
data = {
|
685 |
+
"train": [{"input": "SomeInput1", "output": "SomeResult1"},
|
686 |
+
{"input": "SomeInput2", "output": "SomeResult2"}],
|
687 |
+
"test": [{"input": "SomeInput3", "output": "SomeResult3"},
|
688 |
+
{"input": "SomeInput4", "output": "SomeResult4"}]
|
689 |
}
|
690 |
loader = LoadFromDictionary(data=data)
|
691 |
"""
|
|
|
795 |
else:
|
796 |
data_files = self.data_files
|
797 |
|
798 |
+
dir_paths_list = []
|
799 |
for files in data_files:
|
800 |
if isinstance(files, str):
|
801 |
files = [files]
|
802 |
+
|
803 |
paths = [self._download_file_from_space(file) for file in files]
|
804 |
+
dir_paths = [
|
805 |
+
path.replace(file_url, "") for path, file_url in zip(paths, files)
|
806 |
+
]
|
807 |
+
dir_paths_list.extend(dir_paths)
|
808 |
+
|
809 |
+
# All files - within the same space - are downloaded into the same base directory:
|
810 |
+
assert len(set(dir_paths_list)) == 1
|
811 |
+
|
812 |
+
return f"{dir_paths_list.pop()}"
|
813 |
+
|
814 |
+
@staticmethod
|
815 |
+
def _is_wildcard(path: str) -> bool:
|
816 |
+
wildcard_characters = ["*", "?", "[", "]"]
|
817 |
+
return any(char in path for char in wildcard_characters)
|
818 |
+
|
819 |
+
def _get_file_list_from_wildcard_path(
|
820 |
+
self, pattern: str, repo_files: List
|
821 |
+
) -> List[str]:
|
822 |
+
if self._is_wildcard(pattern):
|
823 |
+
return fnmatch.filter(repo_files, pattern)
|
824 |
+
return [pattern]
|
825 |
+
|
826 |
+
def _map_wildcard_path_to_full_paths(self):
|
827 |
+
api = HfApi()
|
828 |
+
repo_files = api.list_repo_files(self.space_name, repo_type="space")
|
829 |
+
if isinstance(self.data_files, str):
|
830 |
+
self.data_files = self._get_file_list_from_wildcard_path(
|
831 |
+
self.data_files, repo_files
|
832 |
+
)
|
833 |
+
elif isinstance(self.data_files, Mapping):
|
834 |
+
new_mapping = {}
|
835 |
+
for k, v in self.data_files.items():
|
836 |
+
if isinstance(v, list):
|
837 |
+
assert all(isinstance(s, str) for s in v)
|
838 |
+
new_mapping[k] = [
|
839 |
+
file
|
840 |
+
for p in v
|
841 |
+
for file in self._get_file_list_from_wildcard_path(
|
842 |
+
p, repo_files
|
843 |
+
)
|
844 |
+
]
|
845 |
+
elif isinstance(v, str):
|
846 |
+
new_mapping[k] = self._get_file_list_from_wildcard_path(
|
847 |
+
v, repo_files
|
848 |
+
)
|
849 |
+
else:
|
850 |
+
raise NotImplementedError(
|
851 |
+
f"Loader does not support input 'data_files' of type Mapping[{type(v)}]"
|
852 |
+
)
|
853 |
|
854 |
+
self.data_files = new_mapping
|
855 |
+
elif isinstance(self.data_files, list):
|
856 |
+
assert all(isinstance(s, str) for s in self.data_files)
|
857 |
+
self.data_files = [
|
858 |
+
file
|
859 |
+
for p in self.data_files
|
860 |
+
for file in self._get_file_list_from_wildcard_path(p, repo_files)
|
861 |
+
]
|
862 |
+
else:
|
863 |
+
raise NotImplementedError(
|
864 |
+
f"Loader does not support input 'data_files' of type {type(self.data_files)}"
|
865 |
+
)
|
866 |
|
867 |
def load_data(self):
|
868 |
self.sef_default_data_classification(
|
869 |
["public"], "when loading from Huggingface spaces"
|
870 |
)
|
871 |
+
self._map_wildcard_path_to_full_paths()
|
872 |
self.path = self._download_data()
|
873 |
return super().load_data()
|
metric.py
CHANGED
@@ -19,16 +19,13 @@ from .file_utils import __file__ as _
|
|
19 |
from .formats import __file__ as _
|
20 |
from .fusion import __file__ as _
|
21 |
from .generator_utils import __file__ as _
|
22 |
-
from .hf_utils import __file__ as _
|
23 |
from .hf_utils import verify_versions_compatibility
|
24 |
from .inference import __file__ as _
|
25 |
from .instructions import __file__ as _
|
26 |
from .llm_as_judge import __file__ as _
|
27 |
from .loaders import __file__ as _
|
28 |
from .logging_utils import __file__ as _
|
29 |
-
from .metric_utils import UNITXT_METRIC_SCHEMA
|
30 |
-
from .metric_utils import __file__ as _
|
31 |
-
from .metric_utils import _compute
|
32 |
from .metrics import __file__ as _
|
33 |
from .normalizers import __file__ as _
|
34 |
from .operator import __file__ as _
|
@@ -39,13 +36,13 @@ from .random_utils import __file__ as _
|
|
39 |
from .recipe import __file__ as _
|
40 |
from .register import __file__ as _
|
41 |
from .schema import __file__ as _
|
42 |
-
from .settings_utils import __file__ as _
|
43 |
from .settings_utils import get_constants
|
44 |
from .span_lableing_operators import __file__ as _
|
45 |
from .split_utils import __file__ as _
|
46 |
from .splitters import __file__ as _
|
47 |
from .standard import __file__ as _
|
48 |
from .stream import __file__ as _
|
|
|
49 |
from .string_operators import __file__ as _
|
50 |
from .struct_data_operators import __file__ as _
|
51 |
from .system_prompts import __file__ as _
|
@@ -53,7 +50,6 @@ from .task import __file__ as _
|
|
53 |
from .templates import __file__ as _
|
54 |
from .text_utils import __file__ as _
|
55 |
from .type_utils import __file__ as _
|
56 |
-
from .utils import __file__ as _
|
57 |
from .utils import is_package_installed
|
58 |
from .validate import __file__ as _
|
59 |
from .version import __file__ as _
|
|
|
19 |
from .formats import __file__ as _
|
20 |
from .fusion import __file__ as _
|
21 |
from .generator_utils import __file__ as _
|
|
|
22 |
from .hf_utils import verify_versions_compatibility
|
23 |
from .inference import __file__ as _
|
24 |
from .instructions import __file__ as _
|
25 |
from .llm_as_judge import __file__ as _
|
26 |
from .loaders import __file__ as _
|
27 |
from .logging_utils import __file__ as _
|
28 |
+
from .metric_utils import UNITXT_METRIC_SCHEMA, _compute
|
|
|
|
|
29 |
from .metrics import __file__ as _
|
30 |
from .normalizers import __file__ as _
|
31 |
from .operator import __file__ as _
|
|
|
36 |
from .recipe import __file__ as _
|
37 |
from .register import __file__ as _
|
38 |
from .schema import __file__ as _
|
|
|
39 |
from .settings_utils import get_constants
|
40 |
from .span_lableing_operators import __file__ as _
|
41 |
from .split_utils import __file__ as _
|
42 |
from .splitters import __file__ as _
|
43 |
from .standard import __file__ as _
|
44 |
from .stream import __file__ as _
|
45 |
+
from .stream_operators import __file__ as _
|
46 |
from .string_operators import __file__ as _
|
47 |
from .struct_data_operators import __file__ as _
|
48 |
from .system_prompts import __file__ as _
|
|
|
50 |
from .templates import __file__ as _
|
51 |
from .text_utils import __file__ as _
|
52 |
from .type_utils import __file__ as _
|
|
|
53 |
from .utils import is_package_installed
|
54 |
from .validate import __file__ as _
|
55 |
from .version import __file__ as _
|
metric_utils.py
CHANGED
@@ -15,7 +15,7 @@ from .operator import (
|
|
15 |
from .operators import (
|
16 |
ApplyMetric,
|
17 |
ApplyOperatorsField,
|
18 |
-
|
19 |
FlattenInstances,
|
20 |
MergeStreams,
|
21 |
SplitByNestedGroup,
|
@@ -23,7 +23,7 @@ from .operators import (
|
|
23 |
from .register import _reset_env_local_catalogs, register_all_artifacts
|
24 |
from .schema import UNITXT_DATASET_SCHEMA
|
25 |
from .settings_utils import get_settings
|
26 |
-
from .stream import
|
27 |
from .struct_data_operators import LoadJson
|
28 |
|
29 |
|
@@ -109,7 +109,7 @@ class MultiStreamScoreMean(MultiStreamOperator):
|
|
109 |
|
110 |
return MultiStream(
|
111 |
{
|
112 |
-
stream_name:
|
113 |
never_peek_twice_generator,
|
114 |
gen_kwargs={
|
115 |
"stream_name": stream_name,
|
@@ -132,7 +132,7 @@ class FromPredictionsAndOriginalData(StreamInitializerOperator):
|
|
132 |
) -> MultiStream:
|
133 |
return MultiStream(
|
134 |
{
|
135 |
-
split_name:
|
136 |
self.zip,
|
137 |
gen_kwargs={"predictions": predictions, "references": references},
|
138 |
)
|
@@ -155,10 +155,9 @@ class MetricRecipe(SequentialOperatorInitializer):
|
|
155 |
self.steps = [
|
156 |
FromPredictionsAndOriginalData(),
|
157 |
LoadJson(field="task_data"),
|
158 |
-
|
159 |
-
|
160 |
-
|
161 |
-
}
|
162 |
),
|
163 |
ApplyOperatorsField(
|
164 |
operators_field="postprocessors",
|
|
|
15 |
from .operators import (
|
16 |
ApplyMetric,
|
17 |
ApplyOperatorsField,
|
18 |
+
Copy,
|
19 |
FlattenInstances,
|
20 |
MergeStreams,
|
21 |
SplitByNestedGroup,
|
|
|
23 |
from .register import _reset_env_local_catalogs, register_all_artifacts
|
24 |
from .schema import UNITXT_DATASET_SCHEMA
|
25 |
from .settings_utils import get_settings
|
26 |
+
from .stream import DynamicStream, MultiStream
|
27 |
from .struct_data_operators import LoadJson
|
28 |
|
29 |
|
|
|
109 |
|
110 |
return MultiStream(
|
111 |
{
|
112 |
+
stream_name: DynamicStream(
|
113 |
never_peek_twice_generator,
|
114 |
gen_kwargs={
|
115 |
"stream_name": stream_name,
|
|
|
132 |
) -> MultiStream:
|
133 |
return MultiStream(
|
134 |
{
|
135 |
+
split_name: DynamicStream(
|
136 |
self.zip,
|
137 |
gen_kwargs={"predictions": predictions, "references": references},
|
138 |
)
|
|
|
155 |
self.steps = [
|
156 |
FromPredictionsAndOriginalData(),
|
157 |
LoadJson(field="task_data"),
|
158 |
+
Copy(
|
159 |
+
field="source",
|
160 |
+
to_field="task_data/source",
|
|
|
161 |
),
|
162 |
ApplyOperatorsField(
|
163 |
operators_field="postprocessors",
|
operator.py
CHANGED
@@ -4,7 +4,7 @@ from typing import Any, Dict, Generator, List, Optional, Union
|
|
4 |
|
5 |
from .artifact import Artifact
|
6 |
from .dataclass import InternalField, NonPositionalField
|
7 |
-
from .stream import
|
8 |
from .utils import is_module_available
|
9 |
|
10 |
|
@@ -170,7 +170,7 @@ def instance_generator(instance):
|
|
170 |
|
171 |
|
172 |
def stream_single(instance: Dict[str, Any]) -> Stream:
|
173 |
-
return
|
174 |
generator=instance_generator, gen_kwargs={"instance": instance}
|
175 |
)
|
176 |
|
@@ -244,7 +244,7 @@ class StreamOperator(MultiStreamOperator):
|
|
244 |
def _process_single_stream(
|
245 |
self, stream: Stream, stream_name: Optional[str] = None
|
246 |
) -> Stream:
|
247 |
-
return
|
248 |
self._process_stream,
|
249 |
gen_kwargs={"stream": stream, "stream_name": stream_name},
|
250 |
)
|
@@ -401,7 +401,7 @@ class InstanceOperatorValidator(InstanceOperator):
|
|
401 |
try:
|
402 |
first_instance = next(iterator)
|
403 |
except StopIteration as e:
|
404 |
-
raise
|
405 |
result = self._process_instance(first_instance, stream_name)
|
406 |
self.validate(result)
|
407 |
yield result
|
@@ -439,7 +439,7 @@ class InstanceOperatorWithMultiStreamAccess(StreamingOperator):
|
|
439 |
result = {}
|
440 |
|
441 |
for stream_name, stream in multi_stream.items():
|
442 |
-
stream =
|
443 |
self.generator,
|
444 |
gen_kwargs={"stream": stream, "multi_stream": multi_stream},
|
445 |
)
|
|
|
4 |
|
5 |
from .artifact import Artifact
|
6 |
from .dataclass import InternalField, NonPositionalField
|
7 |
+
from .stream import DynamicStream, EmptyStreamError, MultiStream, Stream
|
8 |
from .utils import is_module_available
|
9 |
|
10 |
|
|
|
170 |
|
171 |
|
172 |
def stream_single(instance: Dict[str, Any]) -> Stream:
|
173 |
+
return DynamicStream(
|
174 |
generator=instance_generator, gen_kwargs={"instance": instance}
|
175 |
)
|
176 |
|
|
|
244 |
def _process_single_stream(
|
245 |
self, stream: Stream, stream_name: Optional[str] = None
|
246 |
) -> Stream:
|
247 |
+
return DynamicStream(
|
248 |
self._process_stream,
|
249 |
gen_kwargs={"stream": stream, "stream_name": stream_name},
|
250 |
)
|
|
|
401 |
try:
|
402 |
first_instance = next(iterator)
|
403 |
except StopIteration as e:
|
404 |
+
raise EmptyStreamError(f"Stream '{stream_name}' is empty") from e
|
405 |
result = self._process_instance(first_instance, stream_name)
|
406 |
self.validate(result)
|
407 |
yield result
|
|
|
439 |
result = {}
|
440 |
|
441 |
for stream_name, stream in multi_stream.items():
|
442 |
+
stream = DynamicStream(
|
443 |
self.generator,
|
444 |
gen_kwargs={"stream": stream, "multi_stream": multi_stream},
|
445 |
)
|
operators.py
CHANGED
@@ -76,7 +76,7 @@ from .operator import (
|
|
76 |
)
|
77 |
from .random_utils import new_random_generator
|
78 |
from .settings_utils import get_settings
|
79 |
-
from .stream import
|
80 |
from .text_utils import nested_tuple_to_string
|
81 |
from .type_utils import isoftype
|
82 |
from .utils import flatten_dict
|
@@ -282,6 +282,24 @@ class RemoveFields(InstanceOperator):
|
|
282 |
return instance
|
283 |
|
284 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
285 |
class InstanceFieldOperator(InstanceOperator):
|
286 |
"""A general stream instance operator that processes the values of a field (or multiple ones).
|
287 |
|
@@ -1007,7 +1025,7 @@ class Perturb(FieldOperator):
|
|
1007 |
return value
|
1008 |
|
1009 |
|
1010 |
-
class
|
1011 |
"""Copies values from specified fields to specified fields.
|
1012 |
|
1013 |
Args (of parent class):
|
@@ -1015,13 +1033,13 @@ class CopyFields(FieldOperator):
|
|
1015 |
|
1016 |
Examples:
|
1017 |
An input instance {"a": 2, "b": 3}, when processed by
|
1018 |
-
|
1019 |
would yield {"a": 2, "b": 2}, and when processed by
|
1020 |
-
|
1021 |
{"a": 2, "b": 3, "c": 2}
|
1022 |
|
1023 |
with field names containing / , we can also copy inside the field:
|
1024 |
-
|
1025 |
would process instance {"a": [1, 3]} into {"a": 1}
|
1026 |
|
1027 |
|
@@ -1031,6 +1049,10 @@ class CopyFields(FieldOperator):
|
|
1031 |
return copy.deepcopy(value)
|
1032 |
|
1033 |
|
|
|
|
|
|
|
|
|
1034 |
class GetItemByIndex(FieldOperator):
|
1035 |
"""Get from the item list by the index in the field."""
|
1036 |
|
@@ -1299,6 +1321,52 @@ class FilterByCondition(StreamOperator):
|
|
1299 |
return True
|
1300 |
|
1301 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1302 |
class ComputeExpressionMixin(Artifact):
|
1303 |
"""Computes an expression expressed over fields of an instance.
|
1304 |
|
@@ -1774,7 +1842,7 @@ class MergeStreams(MultiStreamOperator):
|
|
1774 |
def process(self, multi_stream: MultiStream) -> MultiStream:
|
1775 |
return MultiStream(
|
1776 |
{
|
1777 |
-
self.new_stream_name:
|
1778 |
self.merge, gen_kwargs={"multi_stream": multi_stream}
|
1779 |
)
|
1780 |
}
|
|
|
76 |
)
|
77 |
from .random_utils import new_random_generator
|
78 |
from .settings_utils import get_settings
|
79 |
+
from .stream import DynamicStream, Stream
|
80 |
from .text_utils import nested_tuple_to_string
|
81 |
from .type_utils import isoftype
|
82 |
from .utils import flatten_dict
|
|
|
282 |
return instance
|
283 |
|
284 |
|
285 |
+
class SelectFields(InstanceOperator):
|
286 |
+
"""Keep only specified fields from each instance in a stream.
|
287 |
+
|
288 |
+
Args:
|
289 |
+
fields (List[str]): The fields to keep from each instance.
|
290 |
+
"""
|
291 |
+
|
292 |
+
fields: List[str]
|
293 |
+
|
294 |
+
def process(
|
295 |
+
self, instance: Dict[str, Any], stream_name: Optional[str] = None
|
296 |
+
) -> Dict[str, Any]:
|
297 |
+
new_instance = {}
|
298 |
+
for selected_field in self.fields:
|
299 |
+
new_instance[selected_field] = instance[selected_field]
|
300 |
+
return new_instance
|
301 |
+
|
302 |
+
|
303 |
class InstanceFieldOperator(InstanceOperator):
|
304 |
"""A general stream instance operator that processes the values of a field (or multiple ones).
|
305 |
|
|
|
1025 |
return value
|
1026 |
|
1027 |
|
1028 |
+
class Copy(FieldOperator):
|
1029 |
"""Copies values from specified fields to specified fields.
|
1030 |
|
1031 |
Args (of parent class):
|
|
|
1033 |
|
1034 |
Examples:
|
1035 |
An input instance {"a": 2, "b": 3}, when processed by
|
1036 |
+
Copy(field_to_field={"a": "b"}
|
1037 |
would yield {"a": 2, "b": 2}, and when processed by
|
1038 |
+
Copy(field_to_field={"a": "c"} would yield
|
1039 |
{"a": 2, "b": 3, "c": 2}
|
1040 |
|
1041 |
with field names containing / , we can also copy inside the field:
|
1042 |
+
Copy(field="a/0",to_field="a")
|
1043 |
would process instance {"a": [1, 3]} into {"a": 1}
|
1044 |
|
1045 |
|
|
|
1049 |
return copy.deepcopy(value)
|
1050 |
|
1051 |
|
1052 |
+
class CopyFields(Copy):
|
1053 |
+
pass
|
1054 |
+
|
1055 |
+
|
1056 |
class GetItemByIndex(FieldOperator):
|
1057 |
"""Get from the item list by the index in the field."""
|
1058 |
|
|
|
1321 |
return True
|
1322 |
|
1323 |
|
1324 |
+
class FilterByConditionBasedOnFields(FilterByCondition):
|
1325 |
+
"""Filters a stream based on a condition between 2 fields values.
|
1326 |
+
|
1327 |
+
Raises an error if either of the required fields names is missing from the input instance.
|
1328 |
+
|
1329 |
+
Args:
|
1330 |
+
values (Dict[str, str]): The fields names that the filter operation is based on.
|
1331 |
+
condition: the name of the desired condition operator between the specified field's values. Supported conditions are ("gt", "ge", "lt", "le", "ne", "eq", "in","not in")
|
1332 |
+
error_on_filtered_all (bool, optional): If True, raises an error if all instances are filtered out. Defaults to True.
|
1333 |
+
|
1334 |
+
Examples:
|
1335 |
+
FilterByCondition(values = {"a":"b}, condition = "gt") will yield only instances where field "a" contains a value greater then the value in field "b".
|
1336 |
+
FilterByCondition(values = {"a":"b}, condition = "le") will yield only instances where "a"<="b"
|
1337 |
+
"""
|
1338 |
+
|
1339 |
+
def _is_required(self, instance: dict) -> bool:
|
1340 |
+
for key, value in self.values.items():
|
1341 |
+
try:
|
1342 |
+
instance_key = dict_get(instance, key)
|
1343 |
+
except ValueError as ve:
|
1344 |
+
raise ValueError(
|
1345 |
+
f"Required filter field ('{key}') in FilterByCondition is not found in {instance}"
|
1346 |
+
) from ve
|
1347 |
+
try:
|
1348 |
+
instance_value = dict_get(instance, value)
|
1349 |
+
except ValueError as ve:
|
1350 |
+
raise ValueError(
|
1351 |
+
f"Required filter field ('{value}') in FilterByCondition is not found in {instance}"
|
1352 |
+
) from ve
|
1353 |
+
if self.condition == "in":
|
1354 |
+
if instance_key not in instance_value:
|
1355 |
+
return False
|
1356 |
+
elif self.condition == "not in":
|
1357 |
+
if instance_key in instance_value:
|
1358 |
+
return False
|
1359 |
+
else:
|
1360 |
+
func = self.condition_to_func[self.condition]
|
1361 |
+
if func is None:
|
1362 |
+
raise ValueError(
|
1363 |
+
f"Function not defined for condition '{self.condition}'"
|
1364 |
+
)
|
1365 |
+
if not func(instance_key, instance_value):
|
1366 |
+
return False
|
1367 |
+
return True
|
1368 |
+
|
1369 |
+
|
1370 |
class ComputeExpressionMixin(Artifact):
|
1371 |
"""Computes an expression expressed over fields of an instance.
|
1372 |
|
|
|
1842 |
def process(self, multi_stream: MultiStream) -> MultiStream:
|
1843 |
return MultiStream(
|
1844 |
{
|
1845 |
+
self.new_stream_name: DynamicStream(
|
1846 |
self.merge, gen_kwargs={"multi_stream": multi_stream}
|
1847 |
)
|
1848 |
}
|
processors.py
CHANGED
@@ -245,3 +245,11 @@ class LiteralEval(FieldOperator):
|
|
245 |
if text is None or text == "":
|
246 |
return text
|
247 |
return ast.literal_eval(text.strip())
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
245 |
if text is None or text == "":
|
246 |
return text
|
247 |
return ast.literal_eval(text.strip())
|
248 |
+
|
249 |
+
|
250 |
+
class ExtractSafeUnsafeJudgment(FieldOperator):
|
251 |
+
def process_value(self, text: Any) -> Any:
|
252 |
+
first_line = str(text).strip().split("\n")[0].lower()
|
253 |
+
if first_line == "safe":
|
254 |
+
return 1.0
|
255 |
+
return 0.0
|
settings_utils.py
CHANGED
@@ -127,6 +127,7 @@ if Settings.is_uninitilized():
|
|
127 |
settings.artifactories = None
|
128 |
settings.default_recipe = "standard_recipe"
|
129 |
settings.default_verbosity = "info"
|
|
|
130 |
settings.remote_metrics = []
|
131 |
settings.test_card_disable = (bool, False)
|
132 |
settings.test_metric_disable = (bool, False)
|
|
|
127 |
settings.artifactories = None
|
128 |
settings.default_recipe = "standard_recipe"
|
129 |
settings.default_verbosity = "info"
|
130 |
+
settings.use_eager_execution = False
|
131 |
settings.remote_metrics = []
|
132 |
settings.test_card_disable = (bool, False)
|
133 |
settings.test_metric_disable = (bool, False)
|
split_utils.py
CHANGED
@@ -5,7 +5,7 @@ from typing import Dict
|
|
5 |
from .generator_utils import ReusableGenerator
|
6 |
from .logging_utils import get_logger
|
7 |
from .random_utils import new_random_generator
|
8 |
-
from .stream import Stream
|
9 |
|
10 |
logger = get_logger()
|
11 |
|
@@ -140,7 +140,9 @@ def slice_streams(input_streams, mapping):
|
|
140 |
def generator(new_stream, sources):
|
141 |
for old_stream, slices in sources.items():
|
142 |
if old_stream not in input_streams:
|
143 |
-
raise
|
|
|
|
|
144 |
old_stream_content = input_streams[old_stream]
|
145 |
for start, end in slices:
|
146 |
yield from slice_stream(old_stream_content, start, end)
|
|
|
5 |
from .generator_utils import ReusableGenerator
|
6 |
from .logging_utils import get_logger
|
7 |
from .random_utils import new_random_generator
|
8 |
+
from .stream import MissingStreamError, Stream
|
9 |
|
10 |
logger = get_logger()
|
11 |
|
|
|
140 |
def generator(new_stream, sources):
|
141 |
for old_stream, slices in sources.items():
|
142 |
if old_stream not in input_streams:
|
143 |
+
raise MissingStreamError(
|
144 |
+
f"'{old_stream}' is not available in input streams, but need to slice there from"
|
145 |
+
)
|
146 |
old_stream_content = input_streams[old_stream]
|
147 |
for start, end in slices:
|
148 |
yield from slice_stream(old_stream_content, start, end)
|
splitters.py
CHANGED
@@ -1,5 +1,6 @@
|
|
1 |
import itertools
|
2 |
from abc import abstractmethod
|
|
|
3 |
from random import Random
|
4 |
from typing import Dict, List
|
5 |
|
@@ -13,7 +14,7 @@ from .split_utils import (
|
|
13 |
rename_split,
|
14 |
slice_streams,
|
15 |
)
|
16 |
-
from .stream import MultiStream
|
17 |
|
18 |
|
19 |
class Splitter(MultiStreamOperator):
|
@@ -138,8 +139,13 @@ class Sampler(Artifact):
|
|
138 |
) -> List[Dict[str, object]]:
|
139 |
if "inputs" not in instance:
|
140 |
raise ValueError(f"'inputs' field is missing from '{instance}'.")
|
141 |
-
|
142 |
-
|
|
|
|
|
|
|
|
|
|
|
143 |
|
144 |
|
145 |
class RandomSampler(Sampler):
|
@@ -282,16 +288,20 @@ class SpreadSplit(InstanceOperatorWithMultiStreamAccess):
|
|
282 |
) -> Dict[str, object]:
|
283 |
try:
|
284 |
if self.local_cache is None:
|
285 |
-
self.local_cache = list(multi_stream[self.source_stream])
|
286 |
|
287 |
source_stream = self.local_cache
|
288 |
source_stream = self.sampler.filter_source_by_instance(
|
289 |
source_stream, instance
|
290 |
)
|
|
|
|
|
|
|
|
|
291 |
sampled_instances = self.sampler.sample(source_stream)
|
292 |
instance[self.target_field] = sampled_instances
|
293 |
return instance
|
294 |
-
except
|
295 |
-
raise
|
296 |
f"Unable to fetch instances from '{self.source_stream}' to '{self.target_field}', due to {e.__class__.__name__}: {e}"
|
297 |
) from e
|
|
|
1 |
import itertools
|
2 |
from abc import abstractmethod
|
3 |
+
from copy import deepcopy
|
4 |
from random import Random
|
5 |
from typing import Dict, List
|
6 |
|
|
|
14 |
rename_split,
|
15 |
slice_streams,
|
16 |
)
|
17 |
+
from .stream import EmptyStreamError, FaultyStreamError, MultiStream
|
18 |
|
19 |
|
20 |
class Splitter(MultiStreamOperator):
|
|
|
139 |
) -> List[Dict[str, object]]:
|
140 |
if "inputs" not in instance:
|
141 |
raise ValueError(f"'inputs' field is missing from '{instance}'.")
|
142 |
+
# l = list(filter(lambda x: x["inputs"] != instance["inputs"], instances_pool))
|
143 |
+
try:
|
144 |
+
return [
|
145 |
+
item for item in instances_pool if item["inputs"] != instance["inputs"]
|
146 |
+
]
|
147 |
+
except Exception as e:
|
148 |
+
raise e
|
149 |
|
150 |
|
151 |
class RandomSampler(Sampler):
|
|
|
288 |
) -> Dict[str, object]:
|
289 |
try:
|
290 |
if self.local_cache is None:
|
291 |
+
self.local_cache = deepcopy(list(multi_stream[self.source_stream]))
|
292 |
|
293 |
source_stream = self.local_cache
|
294 |
source_stream = self.sampler.filter_source_by_instance(
|
295 |
source_stream, instance
|
296 |
)
|
297 |
+
if len(source_stream) < self.sampler.sample_size:
|
298 |
+
raise ValueError(
|
299 |
+
f"Size of population to sample from: {len(source_stream)} is smaller than the needed sample_size: {self.sampler.sample_size}."
|
300 |
+
)
|
301 |
sampled_instances = self.sampler.sample(source_stream)
|
302 |
instance[self.target_field] = sampled_instances
|
303 |
return instance
|
304 |
+
except FaultyStreamError as e:
|
305 |
+
raise EmptyStreamError(
|
306 |
f"Unable to fetch instances from '{self.source_stream}' to '{self.target_field}', due to {e.__class__.__name__}: {e}"
|
307 |
) from e
|
stream.py
CHANGED
@@ -1,11 +1,19 @@
|
|
1 |
import tempfile
|
|
|
|
|
2 |
from abc import abstractmethod
|
3 |
-
from
|
|
|
4 |
|
5 |
from datasets import Dataset, DatasetDict, IterableDataset, IterableDatasetDict
|
6 |
|
7 |
from .dataclass import Dataclass, OptionalField
|
8 |
from .generator_utils import CopyingReusableGenerator, ReusableGenerator
|
|
|
|
|
|
|
|
|
|
|
9 |
|
10 |
|
11 |
class Stream(Dataclass):
|
@@ -21,22 +29,32 @@ class Stream(Dataclass):
|
|
21 |
def take(self, n):
|
22 |
pass
|
23 |
|
|
|
|
|
|
|
|
|
24 |
|
25 |
class ListStream(Stream):
|
26 |
instances_list: List[Dict[str, Any]]
|
|
|
27 |
|
28 |
def __iter__(self):
|
|
|
|
|
29 |
return iter(self.instances_list)
|
30 |
|
31 |
def peek(self):
|
32 |
-
return next(iter(self
|
33 |
|
34 |
-
def take(self, n):
|
35 |
for i, instance in enumerate(self.instances_list):
|
36 |
if i >= n:
|
37 |
break
|
38 |
yield instance
|
39 |
|
|
|
|
|
|
|
40 |
|
41 |
class GeneratorStream(Stream):
|
42 |
"""A class for handling streaming data in a customizable way.
|
@@ -88,6 +106,79 @@ class GeneratorStream(Stream):
|
|
88 |
break
|
89 |
yield instance
|
90 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
|
92 |
class MultiStream(dict):
|
93 |
"""A class for handling multiple streams of data in a dictionary-like format.
|
@@ -112,7 +203,7 @@ class MultiStream(dict):
|
|
112 |
isinstance(key, str), "MultiStream keys must be strings"
|
113 |
super().__init__(data)
|
114 |
|
115 |
-
def get_generator(self, key):
|
116 |
"""Gets a generator for a specified key.
|
117 |
|
118 |
Args:
|
@@ -129,7 +220,7 @@ class MultiStream(dict):
|
|
129 |
|
130 |
def set_copying(self, copying: bool):
|
131 |
for stream in self.values():
|
132 |
-
stream.copying
|
133 |
|
134 |
def to_dataset(self, disable_cache=True, cache_dir=None) -> DatasetDict:
|
135 |
with tempfile.TemporaryDirectory() as dir_to_be_deleted:
|
@@ -178,7 +269,7 @@ class MultiStream(dict):
|
|
178 |
assert all(isinstance(v, ReusableGenerator) for v in generators.values())
|
179 |
return cls(
|
180 |
{
|
181 |
-
key:
|
182 |
generator.generator,
|
183 |
gen_kwargs=generator.gen_kwargs,
|
184 |
caching=caching,
|
@@ -204,7 +295,7 @@ class MultiStream(dict):
|
|
204 |
"""
|
205 |
return cls(
|
206 |
{
|
207 |
-
key:
|
208 |
iterable.__iter__,
|
209 |
caching=caching,
|
210 |
copying=copying,
|
|
|
1 |
import tempfile
|
2 |
+
import traceback
|
3 |
+
import warnings
|
4 |
from abc import abstractmethod
|
5 |
+
from copy import deepcopy
|
6 |
+
from typing import Any, Callable, Dict, Generator, Iterable, List
|
7 |
|
8 |
from datasets import Dataset, DatasetDict, IterableDataset, IterableDatasetDict
|
9 |
|
10 |
from .dataclass import Dataclass, OptionalField
|
11 |
from .generator_utils import CopyingReusableGenerator, ReusableGenerator
|
12 |
+
from .logging_utils import get_logger
|
13 |
+
from .settings_utils import get_settings
|
14 |
+
|
15 |
+
settings = get_settings()
|
16 |
+
logger = get_logger()
|
17 |
|
18 |
|
19 |
class Stream(Dataclass):
|
|
|
29 |
def take(self, n):
|
30 |
pass
|
31 |
|
32 |
+
@abstractmethod
|
33 |
+
def set_copying(self, copying: bool):
|
34 |
+
pass
|
35 |
+
|
36 |
|
37 |
class ListStream(Stream):
|
38 |
instances_list: List[Dict[str, Any]]
|
39 |
+
copying: bool = False
|
40 |
|
41 |
def __iter__(self):
|
42 |
+
if self.copying:
|
43 |
+
return iter(deepcopy(self.instances_list))
|
44 |
return iter(self.instances_list)
|
45 |
|
46 |
def peek(self):
|
47 |
+
return next(iter(self))
|
48 |
|
49 |
+
def take(self, n) -> Generator:
|
50 |
for i, instance in enumerate(self.instances_list):
|
51 |
if i >= n:
|
52 |
break
|
53 |
yield instance
|
54 |
|
55 |
+
def set_copying(self, copying: bool):
|
56 |
+
self.copying = copying
|
57 |
+
|
58 |
|
59 |
class GeneratorStream(Stream):
|
60 |
"""A class for handling streaming data in a customizable way.
|
|
|
106 |
break
|
107 |
yield instance
|
108 |
|
109 |
+
def set_copying(self, copying: bool):
|
110 |
+
self.copying = copying
|
111 |
+
|
112 |
+
|
113 |
+
class FaultyStreamError(Exception):
|
114 |
+
"""Base class for all stream-related exceptions."""
|
115 |
+
|
116 |
+
pass
|
117 |
+
|
118 |
+
|
119 |
+
class MissingStreamError(FaultyStreamError):
|
120 |
+
"""Raised when a required stream is missing."""
|
121 |
+
|
122 |
+
pass
|
123 |
+
|
124 |
+
|
125 |
+
class EmptyStreamError(FaultyStreamError):
|
126 |
+
"""Raised when a stream is unexpectedly empty."""
|
127 |
+
|
128 |
+
pass
|
129 |
+
|
130 |
+
|
131 |
+
def eager_failed():
|
132 |
+
traceback.print_exc()
|
133 |
+
warnings.warn(
|
134 |
+
"The eager execution has failed due to the error above.", stacklevel=2
|
135 |
+
)
|
136 |
+
|
137 |
+
|
138 |
+
class DynamicStream(Stream):
|
139 |
+
generator: Callable
|
140 |
+
gen_kwargs: Dict[str, Any] = OptionalField(default_factory=dict)
|
141 |
+
caching: bool = False
|
142 |
+
copying: bool = False
|
143 |
+
|
144 |
+
def __post_init__(self):
|
145 |
+
self.stream = None
|
146 |
+
if settings.use_eager_execution:
|
147 |
+
try:
|
148 |
+
instances_list = []
|
149 |
+
for instance in self.generator(**self.gen_kwargs):
|
150 |
+
instances_list.append(instance)
|
151 |
+
self.stream = ListStream(
|
152 |
+
instances_list=instances_list, copying=self.copying
|
153 |
+
)
|
154 |
+
except FaultyStreamError:
|
155 |
+
eager_failed()
|
156 |
+
except RuntimeError as e:
|
157 |
+
if isinstance(e.__cause__, FaultyStreamError):
|
158 |
+
eager_failed()
|
159 |
+
else:
|
160 |
+
raise e
|
161 |
+
|
162 |
+
if self.stream is None:
|
163 |
+
self.stream = GeneratorStream(
|
164 |
+
generator=self.generator,
|
165 |
+
gen_kwargs=self.gen_kwargs,
|
166 |
+
caching=self.caching,
|
167 |
+
copying=self.copying,
|
168 |
+
)
|
169 |
+
|
170 |
+
def __iter__(self):
|
171 |
+
return self.stream.__iter__()
|
172 |
+
|
173 |
+
def peek(self):
|
174 |
+
return self.stream.peek()
|
175 |
+
|
176 |
+
def take(self, n):
|
177 |
+
return self.stream.take(n)
|
178 |
+
|
179 |
+
def set_copying(self, copying: bool):
|
180 |
+
self.stream.set_copying(copying)
|
181 |
+
|
182 |
|
183 |
class MultiStream(dict):
|
184 |
"""A class for handling multiple streams of data in a dictionary-like format.
|
|
|
203 |
isinstance(key, str), "MultiStream keys must be strings"
|
204 |
super().__init__(data)
|
205 |
|
206 |
+
def get_generator(self, key) -> Generator:
|
207 |
"""Gets a generator for a specified key.
|
208 |
|
209 |
Args:
|
|
|
220 |
|
221 |
def set_copying(self, copying: bool):
|
222 |
for stream in self.values():
|
223 |
+
stream.set_copying(copying)
|
224 |
|
225 |
def to_dataset(self, disable_cache=True, cache_dir=None) -> DatasetDict:
|
226 |
with tempfile.TemporaryDirectory() as dir_to_be_deleted:
|
|
|
269 |
assert all(isinstance(v, ReusableGenerator) for v in generators.values())
|
270 |
return cls(
|
271 |
{
|
272 |
+
key: DynamicStream(
|
273 |
generator.generator,
|
274 |
gen_kwargs=generator.gen_kwargs,
|
275 |
caching=caching,
|
|
|
295 |
"""
|
296 |
return cls(
|
297 |
{
|
298 |
+
key: DynamicStream(
|
299 |
iterable.__iter__,
|
300 |
caching=caching,
|
301 |
copying=copying,
|
stream_operators.py
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""This section describes unitxt operators.
|
2 |
+
|
3 |
+
Operators: Building Blocks of Unitxt Processing Pipelines
|
4 |
+
==============================================================
|
5 |
+
|
6 |
+
Within the Unitxt framework, operators serve as the foundational elements used to assemble processing pipelines.
|
7 |
+
Each operator is designed to perform specific manipulations on dictionary structures within a stream.
|
8 |
+
These operators are callable entities that receive a MultiStream as input.
|
9 |
+
The output is a MultiStream, augmented with the operator's manipulations, which are then systematically applied to each instance in the stream when pulled.
|
10 |
+
|
11 |
+
Creating Custom Operators
|
12 |
+
-------------------------------
|
13 |
+
To enhance the functionality of Unitxt, users are encouraged to develop custom operators.
|
14 |
+
This can be achieved by inheriting from any of the existing operators listed below or from one of the fundamental :class:`base operators<unitxt.operator>`.
|
15 |
+
The primary task in any operator development is to implement the `process` function, which defines the unique manipulations the operator will perform.
|
16 |
+
|
17 |
+
General or Specelized Operators
|
18 |
+
--------------------------------
|
19 |
+
Some operators are specielized in specific task such as:
|
20 |
+
|
21 |
+
- :class:`loaders<unitxt.loaders>` for loading data.
|
22 |
+
- :class:`splitters<unitxt.splitters>` for fixing data splits.
|
23 |
+
- :class:`struct_data_operators<unitxt.struct_data_operators>` for structured data operators.
|
24 |
+
|
25 |
+
Other specelized operators are used by unitxt internally:
|
26 |
+
|
27 |
+
- :class:`templates<unitxt.templates>` for verbalizing data examples.
|
28 |
+
- :class:`formats<unitxt.formats>` for preparing data for models.
|
29 |
+
|
30 |
+
The rest of this section is dedicated for operators that operates on streams.
|
31 |
+
|
32 |
+
"""
|
33 |
+
|
34 |
+
from typing import (
|
35 |
+
List,
|
36 |
+
Literal,
|
37 |
+
Optional,
|
38 |
+
)
|
39 |
+
|
40 |
+
import pandas as pd
|
41 |
+
|
42 |
+
from .operator import (
|
43 |
+
MultiStream,
|
44 |
+
MultiStreamOperator,
|
45 |
+
)
|
46 |
+
from .settings_utils import get_settings
|
47 |
+
from .stream import ListStream
|
48 |
+
|
49 |
+
settings = get_settings()
|
50 |
+
|
51 |
+
|
52 |
+
class JoinStreams(MultiStreamOperator):
|
53 |
+
"""Join multiple streams into a single stream.
|
54 |
+
|
55 |
+
Args:
|
56 |
+
left_stream (str): The stream that will be considered the "left" in the join operations.
|
57 |
+
right_stream (str): The stream that will be considered the "right" in the join operations.
|
58 |
+
how (Literal["left", "right", "inner", "outer", "cross"]): The type of join to be performed.
|
59 |
+
on (Optional[List[str]]): Column names to join on. These must be found in both streams.
|
60 |
+
left_on (Optional[List[str]]): Column names to join on in the left stream.
|
61 |
+
right_on (Optional[List[str]]): Column names to join on in the right streasm.
|
62 |
+
new_stream_name (str): The name of the new stream resulting from the merge.
|
63 |
+
|
64 |
+
Examples:
|
65 |
+
JoinStreams(left_stream = "questions", right_stream = "answers", how="inner", on="question_id", new_stream_name="question_with_answers" ) Join the 'question' and 'answer' stream based on the 'question_id' field using inner join, resulting with a new stream named "question_with_answers".
|
66 |
+
JoinStreams(left_stream = "questions", right_stream = "answers", how="inner", on_left="question_id", on_right="question" new_stream_name="question_with_answers" ) Join the 'question' and 'answer' stream based on the 'question_id' field in the left stream and the 'question' field in the right stream, using inner join, resulting with a new stream named "question_with_answers". This is suitable when the fields have different labels across the streams.
|
67 |
+
"""
|
68 |
+
|
69 |
+
left_stream: str
|
70 |
+
right_stream: str
|
71 |
+
how: Literal["left", "right", "inner", "outer", "cross"]
|
72 |
+
on: Optional[List[str]] = None
|
73 |
+
left_on: Optional[List[str]] = None
|
74 |
+
right_on: Optional[List[str]] = None
|
75 |
+
new_stream_name: str
|
76 |
+
|
77 |
+
def merge(self, multi_stream) -> List:
|
78 |
+
assert self.right_stream in multi_stream and self.left_stream in multi_stream
|
79 |
+
stream_dict = dict(multi_stream.items())
|
80 |
+
left_stream = list(stream_dict[self.left_stream])
|
81 |
+
right_stream = list(stream_dict[self.right_stream])
|
82 |
+
left_stream_df = pd.DataFrame(left_stream)
|
83 |
+
right_stream_df = pd.DataFrame(right_stream)
|
84 |
+
|
85 |
+
# Remove common col we don't join on, so we don't have unexpected column (standard behavior is to add a suffix)
|
86 |
+
common_cols = set(left_stream_df.columns).intersection(
|
87 |
+
set(right_stream_df.columns)
|
88 |
+
)
|
89 |
+
on = self.on if self.on is not None else []
|
90 |
+
left_on = self.left_on if self.left_on is not None else []
|
91 |
+
right_on = self.right_on if self.right_on is not None else []
|
92 |
+
on_cols = set(on + left_on + right_on)
|
93 |
+
col_to_remove = list(common_cols - on_cols)
|
94 |
+
left_stream_df = left_stream_df.drop(columns=col_to_remove, errors="ignore")
|
95 |
+
right_stream_df = right_stream_df.drop(columns=col_to_remove, errors="ignore")
|
96 |
+
|
97 |
+
merged_df = pd.merge(
|
98 |
+
left_stream_df,
|
99 |
+
right_stream_df,
|
100 |
+
how=self.how,
|
101 |
+
on=self.on,
|
102 |
+
left_on=self.left_on,
|
103 |
+
right_on=self.right_on,
|
104 |
+
)
|
105 |
+
return merged_df.to_dict(orient="records")
|
106 |
+
|
107 |
+
def process(self, multi_stream: MultiStream) -> MultiStream:
|
108 |
+
merged_records = self.merge(multi_stream)
|
109 |
+
multi_stream[self.new_stream_name] = ListStream(instances_list=merged_records)
|
110 |
+
return multi_stream
|
111 |
+
|
112 |
+
|
113 |
+
class DeleteSplits(MultiStreamOperator):
|
114 |
+
"""Operator which delete splits in stream.
|
115 |
+
|
116 |
+
Attributes:
|
117 |
+
splits (List[str]): The splits to delete from the stream.
|
118 |
+
"""
|
119 |
+
|
120 |
+
splits: List[str]
|
121 |
+
|
122 |
+
def process(self, multi_stream: MultiStream) -> MultiStream:
|
123 |
+
generators = {
|
124 |
+
key: val for key, val in multi_stream.items() if key not in self.splits
|
125 |
+
}
|
126 |
+
return MultiStream(generators)
|
struct_data_operators.py
CHANGED
@@ -566,3 +566,42 @@ class LoadJson(FieldOperator):
|
|
566 |
class DumpJson(FieldOperator):
|
567 |
def process_value(self, value: str) -> str:
|
568 |
return json.dumps(value)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
566 |
class DumpJson(FieldOperator):
|
567 |
def process_value(self, value: str) -> str:
|
568 |
return json.dumps(value)
|
569 |
+
|
570 |
+
|
571 |
+
class MapHTMLTableToJSON(FieldOperator):
|
572 |
+
"""Converts HTML table format to the basic one (JSON).
|
573 |
+
|
574 |
+
JSON format
|
575 |
+
{
|
576 |
+
"header": ["col1", "col2"],
|
577 |
+
"rows": [["row11", "row12"], ["row21", "row22"], ["row31", "row32"]]
|
578 |
+
}
|
579 |
+
"""
|
580 |
+
|
581 |
+
_requirements_list = ["bs4"]
|
582 |
+
|
583 |
+
def process_value(self, table: Any) -> Any:
|
584 |
+
return self.truncate_table_rows(table_content=table)
|
585 |
+
|
586 |
+
def truncate_table_rows(self, table_content: str) -> Dict:
|
587 |
+
from bs4 import BeautifulSoup
|
588 |
+
|
589 |
+
soup = BeautifulSoup(table_content, "html.parser")
|
590 |
+
|
591 |
+
# Extract header
|
592 |
+
header = []
|
593 |
+
header_cells = soup.find("thead").find_all("th")
|
594 |
+
for cell in header_cells:
|
595 |
+
header.append(cell.get_text())
|
596 |
+
|
597 |
+
# Extract rows
|
598 |
+
rows = []
|
599 |
+
for row in soup.find("tbody").find_all("tr"):
|
600 |
+
row_data = []
|
601 |
+
for cell in row.find_all("td"):
|
602 |
+
row_data.append(cell.get_text())
|
603 |
+
rows.append(row_data)
|
604 |
+
|
605 |
+
# return dictionary
|
606 |
+
|
607 |
+
return {"header": header, "rows": rows}
|
text_utils.py
CHANGED
@@ -135,8 +135,8 @@ def is_made_of_sub_strings(string, sub_strings):
|
|
135 |
return bool(re.match(pattern, string))
|
136 |
|
137 |
|
138 |
-
# Giveמ all the lines of a file, e.g. all the lines of prepare/cards/cohere_for_ai.py,
|
139 |
-
# and an object name, e.g. TaskCard,
|
140 |
# return the ordinal number of the line that starts that object, in our example: the
|
141 |
# line number of the following line (notice that the line where TaskCard is imported
|
142 |
# is not supposed to return):
|
@@ -145,10 +145,12 @@ def is_made_of_sub_strings(string, sub_strings):
|
|
145 |
# the matching close:
|
146 |
# )
|
147 |
# This util depends on ruff to ensure this setting of the card file: that a close of one
|
148 |
-
# tag and the open of the next tag, do not sit in same line, both tags being
|
149 |
-
# major level within TaskCard
|
|
|
|
|
150 |
# flake8: noqa: B007
|
151 |
-
def
|
152 |
all_lines: List[str], obj_name: str, start_search_at_line: int = 0
|
153 |
) -> Tuple[int, int]:
|
154 |
for starting_line in range(start_search_at_line, len(all_lines)):
|
@@ -160,11 +162,28 @@ def lines_defining_obj(
|
|
160 |
return (-1, -1)
|
161 |
num_of_opens = 0
|
162 |
num_of_closes = 0
|
163 |
-
|
|
|
|
|
164 |
num_of_opens += len(re.findall(r"[({[]", all_lines[ending_line]))
|
165 |
num_of_closes += len(re.findall(r"[)}\]]", all_lines[ending_line]))
|
166 |
if num_of_closes == num_of_opens:
|
167 |
break
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
168 |
|
169 |
if num_of_closes != num_of_opens:
|
170 |
raise ValueError(
|
|
|
135 |
return bool(re.match(pattern, string))
|
136 |
|
137 |
|
138 |
+
# Giveמ all the lines of a card preparer file, e.g. all the lines of prepare/cards/cohere_for_ai.py,
|
139 |
+
# and an object name, e.g. TaskCard(,
|
140 |
# return the ordinal number of the line that starts that object, in our example: the
|
141 |
# line number of the following line (notice that the line where TaskCard is imported
|
142 |
# is not supposed to return):
|
|
|
145 |
# the matching close:
|
146 |
# )
|
147 |
# This util depends on ruff to ensure this setting of the card file: that a close of one
|
148 |
+
# tag and the open of the next tag, do not sit in same line, when both tags being
|
149 |
+
# major level within TaskCard.
|
150 |
+
# It also prepares for the case that __description__ tag does not contain balanced
|
151 |
+
# parentheses, since it is often cut in the middle, (with "... see more at")
|
152 |
# flake8: noqa: B007
|
153 |
+
def lines_defining_obj_in_card(
|
154 |
all_lines: List[str], obj_name: str, start_search_at_line: int = 0
|
155 |
) -> Tuple[int, int]:
|
156 |
for starting_line in range(start_search_at_line, len(all_lines)):
|
|
|
162 |
return (-1, -1)
|
163 |
num_of_opens = 0
|
164 |
num_of_closes = 0
|
165 |
+
ending_line = starting_line - 1
|
166 |
+
while ending_line < len(all_lines):
|
167 |
+
ending_line += 1
|
168 |
num_of_opens += len(re.findall(r"[({[]", all_lines[ending_line]))
|
169 |
num_of_closes += len(re.findall(r"[)}\]]", all_lines[ending_line]))
|
170 |
if num_of_closes == num_of_opens:
|
171 |
break
|
172 |
+
if "__description__" in all_lines[ending_line]:
|
173 |
+
# can not trust parentheses inside description.
|
174 |
+
# trust the indentation enforced by ruff, and the way we build __description__:
|
175 |
+
# a line consisting of only __description__=(
|
176 |
+
# followed by one or more lines of text, can not trust opens and closes
|
177 |
+
# in them, followed by a line consisting of only: ),
|
178 |
+
# where the ) is indented with the beginning of __description__
|
179 |
+
tag_indentation = all_lines[ending_line].index("__description__")
|
180 |
+
last_line_to_start_with = (" " * tag_indentation) + ")"
|
181 |
+
while not all_lines[ending_line].startswith(last_line_to_start_with):
|
182 |
+
ending_line += 1
|
183 |
+
if "__description__" in obj_name:
|
184 |
+
return (starting_line, ending_line)
|
185 |
+
num_of_closes += 1 # for this last line of desc
|
186 |
+
# continue to the line following the end of description
|
187 |
|
188 |
if num_of_closes != num_of_opens:
|
189 |
raise ValueError(
|