phasellm.eval#
Support for LLM evaluation.
Module Contents#
Classes#
Tracks a prompt, prompt variables, responses, and the calculated preference. |
|
Tracks human evaluation on the command line and records results. |
|
Presents an objective, prompt, and two potential responses and has a human choose between the two. |
|
Passes two model outputs to GPT-3.5 or GPT-4 and has it decide which is the better output. |
Functions#
|
Reruns a chat message n times, returning a list of responses. Note that this will query an external API n times, so |
- phasellm.eval.simulate_n_chat_simulations(chatbot: phasellm.llms.ChatBot, n: int, out_path_excel: str | None = None) List[str]#
Reruns a chat message n times, returning a list of responses. Note that this will query an external API n times, so please be careful with costs.
- Parameters:
chatbot – the chat sequence to rerun. The last message will be resent.
n – number of times to run the simulation.
out_path_excel – if provides, the output will also be written to an Excel file.
- Returns:
A list of messages representing the responses in the chat.
- class phasellm.eval.BinaryPreference(prompt: str, prompt_vars: str, response1: str, response2: str)#
Tracks a prompt, prompt variables, responses, and the calculated preference.
- Parameters:
prompt – The prompt
prompt_vars – The variables to use in the prompt.
response1 – The first response.
response2 – The second response.
- __repr__()#
Return repr(self).
- set_preference(pref)#
Set the preference of the class.
- get_preference()#
Get the preference of the class.
- class phasellm.eval.EvaluationStream(objective, prompt, models)#
Tracks human evaluation on the command line and records results.
- Parameters:
objective – what you are trying to do.
prompt – the prompt you are using. Could be a summary thereof, too. We do not actively use this prompt in generating data for evaluation.
models – an array of two models. These can be referenced later if need be, but are not necessary for running the evaluation workflow.
- __repr__()#
Return repr(self).
- evaluate(response1, response2)#
Shows both sets of options for review and tracks the result.
- class phasellm.eval.HumanEvaluatorCommandLine#
Presents an objective, prompt, and two potential responses and has a human choose between the two.
- __repr__()#
Return repr(self).
- choose(objective, prompt, response1, response2)#
- class phasellm.eval.GPTEvaluator(apikey, model='gpt-3.5-turbo')#
Passes two model outputs to GPT-3.5 or GPT-4 and has it decide which is the better output.
- Parameters:
apikey – the OpenAI API key.
model – the model to use. Defaults to GPT-3.5 Turbo.
- __repr__()#
Return repr(self).
- choose(objective, prompt, response1, response2)#
Presents the objective of the evaluation task, a prompt, and then two responses. GPT-3.5/GPT-4 chooses the preference. :param objective: the objective of the modeling task. :param prompt: the prompt to use. :param response1: the first response. :param response2: the second response.
- Returns:
1 if response1 is preferred, 2 if response2 is preferred.