cschroeder (Christopher Schröder)

replied to do-me's post about 12 hours ago

Did not know text-splitter yet, thanks!

posted an update 11 days ago

Post

373

⚖️ 𝐀𝐈 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐢𝐬 𝐂𝐨𝐩𝐲𝐫𝐢𝐠𝐡𝐭 𝐈𝐧𝐟𝐫𝐢𝐧𝐠𝐞𝐦𝐞𝐧𝐭

This bold claim is not my opinion, but it has been made in a recent "report" of a group, whose stance is recognizable in their name. It is roughly translated as "Authors' Rights Initiative". They published a report which was also presented before the EU Parliament according to the LinkedIn post below.

I am not really interested in politics, but as an EU citizen I am of course somewhat interested in a reasonable and practical version of the EU AI Act. Not saying there should not be rules around data and AI, but this report is obviously very biased towards one side.

While I think the report itself does not deserve attention, I post it in the hope that you find more examples, where they did not address the issue adequately. Feel free to add to my LinkedIn posts (where the original authors will see it) or here.

[en] Executive summary: https://urheber.info/media/pages/diskurs/ai-training-is-copyright-infringement/3b900058e6-1725460935/executive-summary_engl_final_29-08-2024.pdf
[de] Full report: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214

LinkedIn: https://www.linkedin.com/posts/activity-7238912869268959232-6cFx

posted an update 27 days ago

Post

661

🌟 Liger Kernel: Efficient Triton Kernels for LLM Training

LIGER "is a [Hugging Face-compatible] collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%."

GitHub: https://github.com/linkedin/Liger-Kernel

posted an update 28 days ago

Post

338

📄 ACL 2024: The Missing Papers

Apparently, some papers from the ACL 2024 are still not listed in the ACL Anthology. While this issue will hopefully be fixed soon, we should give those papers additional spotlight.

Some of my favorites:

1. Dolma is an English corpus that encompasses 3 trillion tokens. Additionally, it is accompanied by an exceptional software package that consdierably advances the state-of-the-art in preparing data for LLM pretraining. (Source: I am currently using Dolma.)
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research (2402.00159)

2. In the paper "Same Task, More Tokens: the Impact of Input Length on
the Reasoning Performance of Large Language Models", the authors show how extending the context length impacts an LLM's reasoning performance. I asked myself a similar question a few months ago, and therefore this paper is highly interesting to me.
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models (2402.14848)

This was brought to my attention through a Linkedin post by @ShayeghB , who is also affected:
Ensemble-Based Unsupervised Discontinuous Constituency Parsing by Tree Averaging (2403.00143)

View all the missing papers here:
https://theshayegh.github.io/ACL2024MissingPapers/

replied to victor's post 28 days ago

I want to start by expressing my appreciation for the incredible work Hugging Face has done for the open-source community. Your contributions have been invaluable, and I’m grateful for the tools and resources you've provided.

Please take the following as constructive feedback. I wouldn’t have mentioned these points if you hadn’t asked, and I hope they can be seen as suggestions for further improvement.

Software quality: When I first started using transformers, I was thoroughly impressed. The basic "hello world" examples work wonderfully, making the initial experience smooth and enjoyable. However, nowadays I am am regularly diving deeper into the library, and I am regularly facing challenges such as long-time standing bugs, undocumented issues, lack of API documentation, and occasionally broken functionality. I am only guessing here, but I think the majority of these repos is written by research engineers or researchers, whose focus might be more on the methodological correctness (which is of course crucial as well). That said, it might be helpful to include someone who is stronger in software development and less knowledgeable in ML. This would be the first person to complain about "clean code" issues, and also would be the first to notice problems with the software.
Posts: Great feature! However, it could be enhanced by adding basic text formatting options. This would make posts more visually appealing and easier to read.
Papers: Restricting this to arXiv is too limiting. While I understand the rationale in terms of implementation effort, if the goal is to be the "Github of ML/AI," it might be worth considering support for at least the high-ranking conferences (or a subset thereof). In many cases, the conference version of a paper supersedes the arXiv version, and this restriction may inadvertently encourage the use of preprints over the finalized versions.

Again, these are just my personal pain points, and I’m sharing them with the intention of helping Hugging Face continue to improve.

posted an update about 1 month ago

Post

1878

🔔 Release: small-text v1.4.1

The new release contains some smaller bugfixes. Check it out!

Github: https://github.com/webis-de/small-text
Paper: Small-Text: Active Learning for Text Classification in Python (2107.10314)

posted an update 3 months ago

Post

1469

🔔 Release: small-text v1.4.0

The new version provides a small-text-compatible implementation for the recent AnchorAL strategy by @pietrolesci .

Github: https://github.com/webis-de/small-text
Paper: https://aclanthology.org/2023.eacl-demo.11/
AnchorAL: AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets (2404.05623)

Christopher Schröder

AI & ML interests

Organizations

cschroeder's activity