LLMs (large language models) are models like chatGPT, GPT-4, Gemini, Phi-3, etc. But those models might seem to be not good enough at answering CVX-related questions due to the shortage of CVX-related data that can be used for its training/finetuning. So I crawled the posts from this forum, clean them by myself and a LLM (summarize the posts and remove peoples’ names), end up getting a dataset including cleaned version of 4780 posts of this forum. The dataset is released here.
I have trained a CVX chatbot here, using that dataset and CVX users’ guide and 10000 synthetic Latex-CVX code pairs (if anyone interested, this is done by a home-made program generating fake CVX codes and use chatGPT to translate them back to Latex formula). The chatbot is far from good, can only answer basic questions and write Latex formula into CVX codes (still wrong sometimes).
I think to improve that, that forum conversation dataset has to be re-assembled and polished,and my fake-cvx-code generator has to be polished, to improve the data quality. It’ll takes more time and effort. I think you can train any chatbot these days if you have a large amount of high quality data (if anyone interested, that just means a lot of correct input-output pairs. People always use a part of a sentence/article/conversations… as input, another part of it as output. )
By the way, one related information: people have trained chatbot to do optimization modeling for a while, recently i saw a new one here.