The Background

In the last few weeks, Caicai and me are working on the product, Haye AI, which is an in-context AI assistant.

We dedicated our part time to this product, and we are following the fancy ideas and state-of-the-art technologies to make it better.

And days ago, Meta released llama 3, which really changed something.

The New Open Source Model: LLaMA 3

LLaMA 3 is really impressive in various of aspects:

  • it supports responding in multiple languages, like Chinese. (But sometime it would respond in Pinyin, and some single words are still left in English.)
  • the overall performance is much better than LLaMA 2, closing to GPT 4
  • some online API provider provides llama 3 in extremely low cost, and fast response speed, which makes llama 3 become the most cost-effective AI model right now.

We used to use gpt-3.5-turbo as the major model for most of the tasks, with some predefined presets. I spend hours and hours to modify the prompts in the playground, and the results are not always good. Once I changed the model to gpt-4, the results are much better, but we can’t, because the cost is too high(about 10x - 15x of gpt-3.5-turbo).

So we are really happy to see llama 3, and we are trying to integrate it into our product.


There are several problems we are facing:

  • llama 3 is not natively support function calling, which is a key feature in some tasks
  • still not have reliable LLM Providers to provide llama 3 API
  • even different LLM Providers have different API, event most of them are announced as “Open AI Compatible”, but they are compatible in different ways
    • groq does not support function calling with Streaming API, which means I must manually split the tasks based on it requires function calling or not
    • deepinfra, fireworks, together just ignore the function calling in the request
    • openrouter could support function calling with stream, but it respond tool call in the content, not it the expected tool_call field, which is not following the Open AI API standard

We’re still waiting for that there would be available resources on Azure, or other big cloud provider which could let me rust the reliability. At the same time, I would like to work on the function calling feature, possibly I would build a simple layer as an LLM-Agnosticism “function calling” adapter based on “Reason - Act” prompting.