Anish Athalye

Semlib: LLM-powered Data Processing

Semlib is a Python library for building data processing and data analysis pipelines that leverage the power of large language models (LLMs).

Semlib demoSemlib demo

No, this is not pseudocode. Yes, it runs.

Semlib provides, as building blocks, familiar functional programming primitives like map, reduce, sort, and filter, but with a twist: Semlib’s implementation of these operations are programmed with natural language descriptions rather than code. Under the hood, Semlib handles complexities such as prompting, parsing, concurrency, caching, and cost tracking.

Origin story

I started working on Semlib while trying to solve a concrete problem: processing performance reviews for my team. Like many engineering managers, I was spending hours manually reading through individual self-evaluations and peer feedback, trying to synthesize everything into coherent, actionable feedback for each person. I was happy to do this task manually, but I was curious to see if I could automate it with AI instead.

My first attempt was the obvious approach: dump everything into Claude Desktop ask it to handle the entire task. But the results were disappointing: the model confused self-evaluations with peer feedback, mixed up feedback between individuals, and missed important details.

Then I tried breaking the problem down: first, extract peer feedback for each person from every review, then summarize the extracted feedback for each individual, with a total of n2n^2 LLM queries. This two-step approach worked dramatically better. The structured computation produced higher-quality results than the single-shot approach, even though I was using the same underlying models. This is a form of context engineering.

The lesson was clear: while LLMs are powerful, dumping everything into a long-context model and hoping for the best doesn’t work. Even with today’s reasoning models and agents, that approach often falls short for complex data processing tasks. You need to impose additional structure.

Design

Semlib provides a framework for imposing that necessary structure on complex data processing tasks. Semlib gives programmers a familiar interface—semantic versions of operators like map, reduce, and sort from functional programming languages and data processing systems such as Spark—while implementing nontrivial semantic algorithms like LLM-based sorting and handling LLM orchestration under the hood.

Semlib lets you describe the “what” while it takes care of the “how”, including prompting, parsing, concurrency, caching, and cost tracking. You focus on the logic of your data processing pipeline, not the implementation details of working with LLMs.

This approach has a number of benefits.

Quality. By breaking down a sophisticated data processing task into simpler steps that are solved by today’s LLMs, you can get higher-quality results, even in situations where today’s LLMs might be capable of processing the data in a single shot and ending up with barely acceptable results. (example: analyzing support tickets in Airline Support Report)

Feasibility. Even long-context LLMs have limitations (e.g., 1M tokens in today’s frontier models). Furthermore, performance often drops off with longer inputs. By breaking down the data processing task into smaller steps, you can handle arbitrary-sized data. (example: sorting an arbitrary number of arXiv papers in arXiv Paper Recommendations)

Latency. By breaking down the computation into smaller pieces and structuring it using functional programming primitives like map and reduce, the parts of the computation can be run concurrently, reducing the latency of the overall computation. (example: tree reduce with O(logn)\mathcal{O}(\log n) computation depth in Disneyland Reviews Synthesis)

Cost. By breaking down the computation into simpler sub-tasks, you can use smaller and cheaper models that are capable of solving those sub-tasks, which can reduce data processing costs. Furthermore, you can choose the model on a per-subtask basis, allowing you to further optimize costs. (example: using gpt-4.1-nano for the pre-filtering step in arXiv Paper Recommendations)

Security. By breaking down the computation into tasks that simpler models can handle, you can use open models that you host yourself, allowing you to process sensitive data without having to trust a third party. (example: using gpt-oss and qwen3 in Resume Filtering)

Flexibility. LLMs are great at certain tasks, like natural-language processing. They’re not so great at other tasks, like multiplying numbers. Using Semlib, you can break down your data processing task into multiple steps, some of which use LLMs and others that just use regular old Python code, getting the best of both worlds. (example: Python code for filtering in Resume Filtering)

Case study: automating performance reviews with Semlib

We just did performance reviews on my team. Everybody filled out a Notion doc based on these instructions. There ended up being more variation than I would have liked in how people formatted their reviews: for example, some created a big MxN table, some used section headings, some followed the original bulleted list format, and some mixed together feedback for different people into the same section, complicating the task of processing the reviews.

As part of the process, I needed to go through all the docs and synthesize feedback for each individual, a task which included deduplicating and summarizing feedback, modifying certain feedback to be suitable to be shared directly, and removing unimportant feedback.

To evaluate Semlib, I first did this task manually, then I wrote a program using Semlib to automate this part of the performance review process, and finally I compared the results.

Performance review processing pipeline

I saved each individual’s doc to reviews/{name}.md, and loaded the data in my Jupyter notebook.

(click to see code)
people = [
    i.split(".md")[0] for i in os.listdir("reviews") if i.endswith(".md")
]

reviews = {person: open(f"reviews/{person}.md").read() for person in people}

As the first step, I extracted feedback for each pair of (reviewer, target) by mapping a template over all pairs (a total of n×(n1)n \times (n-1) LLM queries).

(click to see code)
def extract_peer_review(reviewer: str, target: str) -> str:
    return f"""
You are an engineering manager and an expert at analyzing performance reviews of software engineers.

Your task is to read the performance review and extract peer feedback, if any, for {target}.

If there is no peer feedback for {target}, return "(no feedback)".

If there is feedback, separate it into three section headings, using the following template:

<template>
# Peer feedback for {target} from {reviewer}

## Contributions

...

## Strengths

...

## Areas for growth

...

## Other

...
</template>

If any section is empty, put "(no feedback") under that section.

Here is the performance review written by {reviewer}, that includes self-evaluation and peer feedback:

---

{reviews[reviewer]}
""".strip()

from semlib import Session, OnDiskCache

session = Session(cache=OnDiskCache("review_cache.db"), model="gpt-4o")

pairs = [
    (reviewer, target) for target in people for reviewer in people if reviewer != target
]

feedback_list = await session.map(pairs, lambda pair: extract_peer_review(*pair))

peer_feedbacks = {target: [] for target in people}

for (reviewer, target), feedback in zip(pairs, feedback_list):
    peer_feedbacks[target].append(feedback)

As the second step, I summarized feedback for each individual with another map (a total of nn more LLM queries).

(click to see code)
def summarize_peer_feedback(target: str, feedbacks: list[str]) -> str:
    return f"""
You are an engineering manager and an expert at analyzing performance reviews of software engineers.

Your task is to read the following peer feedback for {target}, and summarize it in a format suitable to share directly with {target}. Make sure you:

- Use this template:

<template>
# Summary of peer feedback for {target}

## Contributions

...

## Strengths

...

## Areas for growth

...

## Other

...
</template>

- Each section should contain a bulleted list of points.
- Anonymize the feedback, removing any references to the original reviewers.
- Combine similar pieces of feedback, avoid duplicates.
- Keep the review as concise as possible, while retaining specificity and precision.
- Include at most three bullet points per section, focusing on the most important feedback.
- Write this as feedback directed to {target}, using "you" statements.
- Output raw markdown, omitting any backticks or the <template> tags.

Here is the peer feedback for {target}:

{"\n\n".join(feedbacks)}
""".strip()

summarized_feedback_list = await session.map(
    people,
    lambda target: summarize_peer_feedback(target, peer_feedbacks[target]),
)

summarized_feedback = dict(zip(people, summarized_feedback_list))

Results

Long story short, the AI-generated results were better than mine; I ended up sharing both versions with individuals on the team. For obvious reasons, I can’t share the actual input/output data from this case study; for full runnable examples, see the examples in the Semlib docs.

Comparison with Claude Desktop

I tried giving this prompt to Claude Desktop, along with all the performance review .md files, and asked Claude Sonnet 4 with extended thinking to complete the task.

Overall, the output looked pretty good, but it had certain critical failures that made it unusable. Failure modes included mixing up self-evaluation with peer feedback and mixing up feedback between individuals.

This result isn’t too surprising: it’s a nontrivial task, with a lot of context.

Related work

DocETL (Shankar et al., 2024, later published in VLDB) from Berkeley’s EPIC Data Lab is a system that optimizes semantic document processing pipelines written in a declarative YAML-based DSL. Key research contributions include agent-based rewriting and optimization of the pipeline for quality/cost/feasibility. Follow-up work in DocWrangler (Shankar et al., 2025) contributes a no-code/low-code IDE built on top of DocETL.

Semlib has a Python library interface rather than a YAML-based DSL, allowing for interactive use in a Jupyter notebook and pipelines that work with arbitrary Python objects and intermix traditional algorithms with semantic operations.

LOTUS (Patel et al., 2024, later published in VLDB) from Stanford and Berkeley is a system that optimizes semantic queries over dataframes. Key research contributions include an extension of relational algebra with a formalization of semantic operators and optimizations for semantic operators with statistical accuracy guarantees. The implementation is a Python library that extends Pandas DataFrames with semantic operators.

Semlib goes beyond relational operators, including functionality such as the reduce operator which enables processing arbitrary amounts of data.

Palimpzest (Liu et al., 2024, later published in CIDR) from MIT’s Data Systems Group is a system that optimizes analytics queries that include semantic operators. Key research contributions include the design of an optimizer that supports queries that intermix semantic operators and traditional relational operators. The implementation is a Python library that includes an embedded DSL for writing analytics queries, which are analyzed and optimized by Palimpzest’s program optimizer.

Semlib supports many more semantic operators, such as sort and reduce, and its design as a Python library exposing individual semantic operators as methods allows for easily inspecting intermediate outputs of every step of a data processing pipeline.

Semlib is significantly simpler than these research systems—Semlib is optimized for ease of use in real-world data processing use cases. Semlib is a practical Python library with an easy-to-use API, extensive documentation, detailed examples, a well-designed type system, and 100% test coverage.

Conclusion

Semlib is a practical Python library that helps solve a class of semantic data processing problems that is underserved by current tools such as agents and conversational chatbots. As language models get better, agentic tools will be able to handle increasingly sophisticated tasks, but until the point where we achieve AGI and it’s no longer necessary for humans to break down complex tasks for AIs, there will be a place for libraries like Semlib that help programmers structure and break down data processing problems into tractable sub-tasks for LLMs.