data-measurements-tool / data_measurements

Commit History

All changes minus temp.jsonl from last 3 commits
3d4f393

Ezi Ozoani commited on

adding in "HuggingFaceM4/OBELICS" reference
5cdf956

Ezi commited on

miss match on variable names
394e509

Ezi commited on

try fix
386b032

Ezi Ozoani commited on

can only select available splits
f4b8e6e

Yacine Jernite commited on

Update data_measurements/streamlit_utils.py
c500e3c

yjernite HF staff commited on

Merge branch 'main' of https://huggingface.co/spaces/huggingface/data-measurements-tool-2 into main
df659f9

meg-huggingface commited on

Handling for no words
cda45dd

meg-huggingface commited on

catch missing text_dset
b7bd3e0

Yacine Jernite commited on

Merge branch 'main' of https://huggingface.co/spaces/huggingface/data-measurements-tool-2 into main
b256a5f

meg-huggingface commited on

merge
c2e8ac0

meg-huggingface commited on

cache and link
06262b2

Yacine Jernite commited on

tokenized df bug
4f4c0c4

meg-huggingface commited on

in deployment mode, check whethet the cache_dir exists
0069e8c

Yacine Jernite commited on

Removing option for c4 no clean
55a6a06

meg-huggingface commited on

Try..except catching for errors
14e5c2a

meg-huggingface commited on

Merge branch 'main' of https://huggingface.co/spaces/huggingface/data-measurements-tool-2 into main
fff0313

meg-huggingface commited on

Adding another check to see if live before computing dset peek
a52c513

meg-huggingface commited on

Merge branch 'main' of https://huggingface.co/spaces/huggingface/data-measurements-tool-2 into main
9bb1a4c

Yacine Jernite commited on

Fixing some minor breaks
e0ada71

meg-huggingface commited on

patch to show <10000 point in nPMI for performance
9f53328

Yacine Jernite commited on

fix text in intro
ee64ff9

Yacine Jernite commited on

fix length selectbox
67a181c

Yacine Jernite commited on

Experimenting with direct merge conflict push to Spaces
11d3c97

meg-huggingface commited on

Merge
623d429

meg HF staff commited on

Minor
a0a4b07

meg HF staff commited on

Merging from rollback
ec99b37

meg HF staff commited on

merging dataset statistics file
c24f881

meg HF staff commited on

Merging back dataset statistics
e8ac901

meg-huggingface commited on

Be gone, you merge conflicting filegit rm data_measurements/dataset_statistics.py
2981bb2

meg-huggingface commited on

Updating from rollback
0b7eeeb

meg-huggingface commited on

Update from rollback
f9936fb

meg-huggingface commited on

Adding dependencies for images
deefca3

meg-huggingface commited on

fix embeddings and load figure from file
71da0fd

Yacine Jernite commited on

Switching slider to selectbox for text lengths
84f1693

Sasha commited on

Changing text lengths plot to a static one, saving to .png
abff13d

Sasha commited on

Changing aggrid back to dataframe, even if we can't make the width dynamic
ca9634c

Sasha commited on

Change to npmi display ordering
5546565

meg-huggingface commited on

Loading per-widget. Various changes to streamlit interactions for efficiency.
d3c28ec

meg-huggingface commited on

One more flag passing needed for setting live deployment
e122a90

meg-huggingface commited on

Adds flag for live deployment so that things will not be all recalculated when live.
7c5239c

meg-huggingface commited on

Standardizing filenaming a bit.
0803ab3

meg-huggingface commited on

More modularizing; npmi and labels
a2ae370

meg-huggingface commited on

Some additional modularizing and caching of the text lengths widget
335424f

meg-huggingface commited on

Modularization and caching of text length widget
85cf91c

meg-huggingface commited on

Removes extraneous debugging print statements
6a9c993

meg-huggingface commited on

Begins modularizing so that each widget can be independently loaded without having a requirement on the ordering of load_or_preparing in app.py. This means that each function corresponding to a widget will check if the variables it depends on have been calculated yet. If not, it will call back to calculate them. Because of the messiness this causes with passing the use_cache variable around, I've now set use_cache as a global variable, set when the DatasetStatisticsCacheClass is initialized, and removed the use_cache arguments appearing in nearly every function.
4b53042

meg-huggingface commited on

Removing need to keep around base dset for the header widget; now just saving what is shown -- the first n lines of the base dataset -- as a json, and loading if it's cached.
66693d5

meg-huggingface commited on

Removing any need for a dataframe in expander_general_stats; instead making sure to cache and load the small amount of details needed for this widget. Note I also moved around a couple functions -- same content, just moved -- so that it was easier for me to navigate through the code. I also pulled out a couple of sub-functions from larger functions, again to make the code easier to work with/understand, as well as helping to further modularize so we can limit what needs to be cached.
e1f2cc3

meg-huggingface commited on

Splitting prepare_dataset into preparing the base dataset, and the tokenized dataset. This will help us to have further control over caching and loading data, eventually removing the storage of base dataset.
6af9ef6

meg-huggingface commited on