phasellm.eval#

Support for LLM evaluation.

Module Contents#

Classes#

BinaryPreference

Tracks a prompt, prompt variables, responses, and the calculated preference.

EvaluationStream

Tracks human evaluation on the command line and records results.

HumanEvaluatorCommandLine

Presents an objective, prompt, and two potential responses and has a human choose between the two.

GPTEvaluator

Passes two model outputs to GPT-3.5 or GPT-4 and has it decide which is the better output.

Functions#

simulate_n_chat_simulations(→ List[str])

Reruns a chat message n times, returning a list of responses. Note that this will query an external API n times, so

phasellm.eval.simulate_n_chat_simulations(chatbot: phasellm.llms.ChatBot, n: int, out_path_excel: str | None = None) List[str]#

Reruns a chat message n times, returning a list of responses. Note that this will query an external API n times, so please be careful with costs.

Parameters:
  • chatbot – the chat sequence to rerun. The last message will be resent.

  • n – number of times to run the simulation.

  • out_path_excel – if provides, the output will also be written to an Excel file.

Returns:

A list of messages representing the responses in the chat.

class phasellm.eval.BinaryPreference(prompt: str, prompt_vars: str, response1: str, response2: str)#

Tracks a prompt, prompt variables, responses, and the calculated preference.

Parameters:
  • prompt – The prompt

  • prompt_vars – The variables to use in the prompt.

  • response1 – The first response.

  • response2 – The second response.

__repr__()#

Return repr(self).

set_preference(pref)#

Set the preference of the class.

get_preference()#

Get the preference of the class.

class phasellm.eval.EvaluationStream(objective, prompt, models)#

Tracks human evaluation on the command line and records results.

Parameters:
  • objective – what you are trying to do.

  • prompt – the prompt you are using. Could be a summary thereof, too. We do not actively use this prompt in generating data for evaluation.

  • models – an array of two models. These can be referenced later if need be, but are not necessary for running the evaluation workflow.

__repr__()#

Return repr(self).

evaluate(response1, response2)#

Shows both sets of options for review and tracks the result.

class phasellm.eval.HumanEvaluatorCommandLine#

Presents an objective, prompt, and two potential responses and has a human choose between the two.

__repr__()#

Return repr(self).

choose(objective, prompt, response1, response2)#
class phasellm.eval.GPTEvaluator(apikey, model='gpt-3.5-turbo')#

Passes two model outputs to GPT-3.5 or GPT-4 and has it decide which is the better output.

Parameters:
  • apikey – the OpenAI API key.

  • model – the model to use. Defaults to GPT-3.5 Turbo.

__repr__()#

Return repr(self).

choose(objective, prompt, response1, response2)#

Presents the objective of the evaluation task, a prompt, and then two responses. GPT-3.5/GPT-4 chooses the preference. :param objective: the objective of the modeling task. :param prompt: the prompt to use. :param response1: the first response. :param response2: the second response.

Returns:

1 if response1 is preferred, 2 if response2 is preferred.