`phasellm.eval`#

Support for LLM evaluation.

Module Contents#

Classes#

`BinaryPreference`	Tracks a prompt, prompt variables, responses, and the calculated preference.
`EvaluationStream`	Tracks human evaluation on the command line and records results.
`HumanEvaluatorCommandLine`	Presents an objective, prompt, and two potential responses and has a human choose between the two.
`GPTEvaluator`	Passes two model outputs to GPT-3.5 or GPT-4 and has it decide which is the better output.

Functions#

simulate_n_chat_simulations(→ List[str])

Reruns a chat message n times, returning a list of responses. Note that this will query an external API n times, so

phasellm.eval.simulate_n_chat_simulations(chatbot: phasellm.llms.ChatBot, n: int, out_path_excel: str | None = None) → List[str]#

Reruns a chat message n times, returning a list of responses. Note that this will query an external API n times, so please be careful with costs.

Parameters:

chatbot – the chat sequence to rerun. The last message will be resent.
n – number of times to run the simulation.
out_path_excel – if provides, the output will also be written to an Excel file.

Returns:

A list of messages representing the responses in the chat.

class phasellm.eval.BinaryPreference(prompt: str, prompt_vars: str, response1: str, response2: str)#

Tracks a prompt, prompt variables, responses, and the calculated preference.

Parameters:

prompt – The prompt
prompt_vars – The variables to use in the prompt.
response1 – The first response.
response2 – The second response.

__repr__()#: Return repr(self).

set_preference(pref)#: Set the preference of the class.

get_preference()#: Get the preference of the class.

class phasellm.eval.EvaluationStream(objective, prompt, models)#

Tracks human evaluation on the command line and records results.

Parameters:

objective – what you are trying to do.
prompt – the prompt you are using. Could be a summary thereof, too. We do not actively use this prompt in generating data for evaluation.
models – an array of two models. These can be referenced later if need be, but are not necessary for running the evaluation workflow.

__repr__()#: Return repr(self).

evaluate(response1, response2)#: Shows both sets of options for review and tracks the result.

class phasellm.eval.HumanEvaluatorCommandLine#

Presents an objective, prompt, and two potential responses and has a human choose between the two.

__repr__()#: Return repr(self).

choose(objective, prompt, response1, response2)#

class phasellm.eval.GPTEvaluator(apikey, model='gpt-3.5-turbo')#

Passes two model outputs to GPT-3.5 or GPT-4 and has it decide which is the better output.

Parameters:

apikey – the OpenAI API key.
model – the model to use. Defaults to GPT-3.5 Turbo.

__repr__()#: Return repr(self).

choose(objective, prompt, response1, response2)#

Presents the objective of the evaluation task, a prompt, and then two responses. GPT-3.5/GPT-4 chooses the preference. :param objective: the objective of the modeling task. :param prompt: the prompt to use. :param response1: the first response. :param response2: the second response.

Returns:: 1 if response1 is preferred, 2 if response2 is preferred.

phasellm.eval#

Module Contents#

Classes#

Functions#

`phasellm.eval`#