5 parameter optimizations to improve AI agent performance

Dark purple background mobile re

Reinvent CX with AI agents

In the fast-evolving AI landscape, AI agents have emerged as a powerful tool to enhance customer service and streamline business operations. However, to harness the full potential of these autonomous intelligent systems requires more than implementation—it demands thoughtful configuration of the AI parameters that impact their performance and the user experience. 

In this article, we delve into five essential AI parameter configurations that can elevate your AI agent performance and user experience. Whether you're building your first AI agent or enhancing the capabilities of an entire AI workforce, mastering these fundamental configurations is crucial for maximizing your AI system.

Dark purple background mobile re

Reinvent CX with AI agents

1. Temperature

Temperature is an AI parameter used in text generation that controls the randomness of responses of an AI model. When generating responses, as in large language models (LLMs) like ChatGPT, temperature determines how surprising and diverse the generated text can be.

Here's how it works:

  • Low temperature: When temperature is low (close to 0), the AI model will tend to produce more predictable and "safe" responses. This means it's more likely to generate the most probable word at each step, leading to more repetitive and coherent responses.

  • High temperature: When temperature is high (approaching infinity), the model becomes more creative and unpredictable. It can generate a wider range of responses, including more unusual or unexpected word choices. This can sometimes lead to less coherent but more diverse outputs.

Temperature AI parameter for AI agent

In essence, adjusting the temperature parameter allows users to control the balance between generating safe, predictable responses and exploring more creative, diverse responses with their AI agents. The following examples are the answers at low and high temperature.

Example AI prompt: "What is your favorite place to visit?”

Low temperature response (e.g., 0.5): "My favorite place to visit is the beach. I love the sound of waves crashing and the feel of sand between my toes. It's so peaceful and relaxing."

At a low temperature, the model generates a response that is coherent and focused. It tends to choose the most probable words, resulting in a straightforward and descriptive answer.

High temperature response (e.g., 1.0): "I adore exploring new places! The beach is definitely a top pick for me. The salty breeze, the sun warming my skin, it's like a dream. But then there's also the allure of bustling cities, the history of ancient ruins, or the serenity of mountain peaks."

With a higher temperature, the model introduces more variability and spontaneity into the response of the AI agent. The language used is more diverse and imaginative, resulting in a response that feels more exploratory and less predictable compared to the low-temperature example.

In both examples, the temperature parameter influences how the AI agent generates responses to reflect either a more focused and predictable style (low temperature) or a more varied and imaginative style (high temperature). Adjusting the temperature allows users to control the level of creativity and diversity in the generated text based on their preferences or the specific application context.

2. Top p

Top p (also known as nucleus sampling) is a technique used in AI models to enhance the diversity and relevance of the generated text. Here’s how it works:

  • Top p sampling: Instead of sampling the entire probability distribution of possible next words, top p sampling narrows down the choices to the top p fraction of the distribution. This means the model considers only the most probable words whose cumulative probability mass is less than or equal to p.

  • Dynamic selection: The actual set of words considered varies dynamically based on the cumulative probability mass. This allows for a flexible approach where more or fewer words can be included in the sampling pool depending on their probabilities.

Top p sampling helps in generating more diverse and contextually relevant responses compared to simple random sampling or greedy sampling. It encourages the model to explore different word choices while still maintaining coherence and relevance.

In practice, adjusting the value of p allows users to control the diversity of generated responses. A lower p will lead to more conservative, predictable responses, while a higher p can result in more diverse and surprising outputs. For better understanding, we could see some examples.

Top p AI parameter for AI agent

Example AI prompt: "What are your plans for the weekend?"

Low top p response (e.g., 0.1) : "I'm planning to relax at home. Maybe watch some movies or read a book. It's nice to unwind after a busy week."

With a low top p setting, it restricts the sampling to only the most likely words or tokens, in this case, with a cumulative probability mass of 0.1. The response is more conservative and tends to stick to safer, more predictable choices.

High top p response (e.g., 0.9): "I haven't decided yet! Maybe I'll go hiking if the weather's nice, or try out that new restaurant downtown. Or perhaps just stay in and catch up on my reading list. So many options!"

With a higher top p setting, the model expands the sampling pool to include a broader range of words or tokens whose cumulative probability mass can reach up to 0.9. The model explores multiple possibilities for weekend plans, mentioning different activities and options. The response feels more spontaneous and less predictable compared to the low top p example.

The top p parameter (nucleus sampling) influences the diversity and variability of responses generated by an AI agent. A low top p setting limits the sampling to the most likely tokens, resulting in more conservative and predictable answers. In contrast, a high top p setting allows for a wider range of token choices, leading to more diverse and potentially more creative responses. Adjusting the top p parameter allows users to control the balance between generating coherent responses and exploring more varied outputs based on their preferences or specific application needs.

Graident background

Delight customers with AI customer service

3. Top k

Top k is an AI parameter that limits the number of tokens that a language model considers when generating text. This sampling strategy diversifies the model outputs by restricting the sampling pool to the top k most likely tokens (words or subwords) at each step of text generation, essentially controlling the diversity and coherence of the generated output by restricting the choices to a smaller set of highly likely tokens.

Here’s how top k sampling works:

  1. Probability distribution: Before generating each token (word or subword) in a sequence, the model calculates the probability distribution over its entire vocabulary.

  2. Top k selection: Instead of sampling from the entire probability distribution, top k sampling narrows down the choices to the top k tokens with the highest probabilities. These tokens are considered for the next token in the sequence.

  3. Sampling strategy: From the top k tokens, one token is randomly sampled according to their probabilities. This introduces randomness while still ensuring that the model predominantly selects from the most likely tokens.

  4. Adjustable parameter: The value of k is a hyper-parameter that can be adjusted by the user. A smaller value of k leads to a more constrained sampling process, favoring more predictable and safe responses. Conversely, a larger value of k allows for more diverse and potentially unexpected outputs.

Top k sampling is commonly used in conjunction with other sampling strategies like temperature scaling and top p sampling (nucleus sampling) to control the diversity and quality of generated text in natural language generation tasks such as chatbots, dialogue systems, and text completion tasks. These strategies help balance between generating coherent, contextually appropriate responses and introducing variability and creativity into text generated by an AI agent.

Let’s see an example of high and low top k.

Example AI prompt: What's your favorite book?

Low top k response (e.g., 5): "Harry Potter. What about you?"

In this case, with a low top-k setting, it quickly settles on a single book title and transitions to asking the user's opinion, providing a more direct and focused response.

High top k response (e.g., 50): That's a tough one! I have so many favorites, like "The Lord of the Rings," "To Kill a Mockingbird," and "Pride and Prejudice." Each one offers a unique perspective and captivating storyline. Have you read any of those?"

With a high top-k setting, the model lists multiple books, showcasing its capability to provide diverse recommendations and engage the user in further discussion.

Adjusting the top-k parameter can significantly influence the style and diversity of responses generated by AI agents that run on LLMs, catering to different conversational styles or preferences of the user.

4. Frequency penalty

Frequency penalty is a technique that encourages diversity in generated responses by penalizing the repetition of words or tokens within a certain context or window. This penalty helps mitigate the tendency of models to produce repetitive or redundant outputs, thereby promoting more varied and interesting responses.

Frequency penalty AI parameter for AI agents
Adapted from source

Here’s how frequency penalty works:

  1. Tracking tokens: The model keeps track of the tokens (words or subwords) it has generated recently, typically within a recent window of tokens.

  2. Penalty calculation: A penalty score is calculated based on how frequently tokens are repeated within this window. Tokens that appear more frequently within the window receive a higher penalty score.

  3. Influence on sampling: During the generation process, tokens that have a higher penalty score are less likely to be chosen by the model for the next output token. This encourages the model to explore less frequent or previously unused tokens, thereby increasing the diversity of responses.

  4. Control parameter: Users can typically adjust the strength of the frequency penalty by setting a parameter that determines how heavily repeated tokens are penalized. Higher penalties lead to more diverse but potentially less coherent responses, while lower penalties maintain coherence but may result in more repetitive outputs.

Frequency penalty is one of several techniques used alongside temperature and top p sampling to improve the quality and diversity of responses generated by AI agents that run on LLMs or even small language models (SLMs), especially in conversational settings where natural and varied language output is desired.

5. Presence penalty

Presence penalty is an LLM setting that encourages responses that cover a broader range of topics or aspects within a given context. In other words, it discourages the model from using the same words or phrases in its generated text. This penalty helps mitigate the tendency of AI models to focus too narrowly on specific topics or themes, thereby promoting more comprehensive and balanced responses from AI agents.

Here’s how presence penalty works:

  1. Topic coverage: The model generates responses based on the input context or the AI prompt. Presence penalty evaluates how well the generated response covers different aspects or dimensions of the input context.

  2. Penalty calculation: A penalty score is computed based on the diversity or coverage of topics within the response. Responses that cover a broader set of topics or aspects receive a lower penalty score, indicating they are more balanced and comprehensive.

  3. Influence on generation: During the response generation process, the model considers the penalty score as a factor in deciding which response to generate. Responses with lower penalty scores (indicating better topic coverage) are favored over those with higher penalty scores.

  4. Adjustable parameter: Users can typically adjust the strength of the presence penalty by setting a parameter that determines how heavily the penalty influences the response generation. Higher penalties encourage the model to generate responses that cover more diverse topics, while lower penalties may result in responses that are more focused but potentially less comprehensive.

Presence penalty is particularly useful in applications where generating well-rounded and informative responses is important, such as in conversational AI systems, AI customer service agents, or content generation tools. It helps ensure that responses are not only coherent but also sufficiently diverse and informative across different aspects of the input context.

Gradient

Boost CSAT with AI customer service

Optimizing AI agents with advanced configurations and AI parameters

Mastering the configuration of AI agents is not just a technical endeavor but a strategic imperative for businesses aiming to excel at customer engagement, operational efficiency, and AI in customer service. This process of AI agent customization allows a business to tailor interactions to specific needs and preferences of target users or customers, ensuring the most accurate and relevant responses to deliver the best possible customer experience.

If you’re looking to build custom AI agents for any use case, Sendbird can help. Our robust AI agent platform makes it easy to build and modify AI agents for peak accuracy and relevance, all on foundation of enterprise-grade AI infrastructure that ensures optimal performance with unmatched adaptability, scalability, or security.

If you want to learn more about the future of AI, you might enjoy these related resources:

Gradient background purple blue

Automate customer service with AI agents