Tree-structured Data Preprocessing/Tokenizing
Hello everyone,
Currently, I am conducting research on fine-tuning the chatglm3-6b-128k model on full_text-abstract scientific publication text paires, aiming for the model to generate summaries of the articles automatically. I have collected the necessary data for this purpose.
However, I've encountered an issue during the data preprocessing stage: Since I intend to retain the structural information of the articles, they are formatted as a tree-like structure similar to XML (with tags such as and ). How should I handle this data? Should I ignore these structural details and retain only plain text? Or should I preserve these tags by setting special tokens during tokenization?
Could please recommend some associate researches or share some experiences with me?
Hi daonanZ,
Thank you for your interest in chatglm3-6b-128k. There is no definitive answer to this question before actual training, but I can try to offer some suggestions.
If you are able to extract important information from the XML in plain text form, I would recommend this approach. If you are unsure about this extraction process and think it might introduce some noise, then it might be more appropriate to keep the XML data format as is.
Thank you very much for your kind and prompt response.