Is it appropriate to release a CVX forum conversation dataset for LLM finetuning?

LLMs (large language models) are models like chatGPT, GPT-4, Gemini, Phi-3, etc. But those models might seem to be not good enough at answering CVX-related questions due to the shortage of CVX-related data that can be used for its training/finetuning. So I crawled the posts from this forum, clean them by myself and a LLM (summarize the posts and remove peoples’ names), end up getting a dataset including cleaned version of 4780 posts of this forum. The dataset is released here.

Please tell me if this is appropriate. Thank you.

1 Like

As for appropriateness, no one seems to have objected.

As for the technical merit or value of the endeavor, I won’t offer an opinion.

2 Likes

@jackfsuia A good attempt indeed, and encouraging since we’re heading into the AI era and timely.

It is interesting. Can you produce a good quality CVX forum chatbot out of that?

I have trained a CVX chatbot here, using that dataset and CVX users’ guide and 10000 synthetic Latex-CVX code pairs (if anyone interested, this is done by a home-made program generating fake CVX codes and use chatGPT to translate them back to Latex formula). The chatbot is far from good, can only answer basic questions and write Latex formula into CVX codes (still wrong sometimes).

I think to improve that, that forum conversation dataset has to be re-assembled and polished,and my fake-cvx-code generator has to be polished, to improve the data quality. It’ll takes more time and effort. I think you can train any chatbot these days if you have a large amount of high quality data (if anyone interested, that just means a lot of correct input-output pairs. People always use a part of a sentence/article/conversations… as input, another part of it as output. )

By the way, one related information: people have trained chatbot to do optimization modeling for a while, recently i saw a new one here.