If a Vendor Won't Talk About F1, Walk Away

By Deepti Yenireddy, CEO, Boon AI

I sat through a vendor demo last quarter where the rep clicked through a takeoff in real time. Fixtures lit up on the screen, quantities populated a table, the whole thing took ninety seconds. It looked great. So I asked the only question that matters: “On that sheet, how many fixtures did it miss, and how many did it invent?”

The room went quiet. The rep said they didn’t have that number handy, but accuracy was “very high” and customers were “thrilled.” I asked again, more specifically: what’s your F1 on electrical takeoffs? He didn’t know what F1 was.

That is the whole story of construction AI right now, compressed into one meeting. The demos are polished. The accuracy claims are everywhere. And almost nobody will put a number behind them.

The claim you hear everywhere, and the number nobody shows

Walk any preconstruction trade show floor and count how many booths say “AI-powered.” Then count how many will tell you, on the record, how often their tool is wrong. The first number is most of them. The second number is close to zero.

This is not an accident. Accuracy is the hardest thing to build in construction AI and the easiest thing to fake in a demo. A controlled demo on a sheet the vendor has seen a hundred times will always look clean. Your bid set, full of regional symbol conventions and a legend on page three that governs a count on page eleven, is where tools fall apart. The demo is theater. The number is truth. Vendors who have the number lead with it. Vendors who don’t change the subject to “time savings.”

Time savings is a real benefit, but it’s a trap as a buying criterion. A tool that finishes a takeoff in two minutes and is wrong twenty percent of the time hasn’t saved you anything. It has moved the work from counting to checking, and checking a confident wrong answer is slower than doing it yourself, because now you have to figure out where it lied.

What F1 actually measures, in plain terms

F1 is a single score, between zero and one, that captures two ways an automated takeoff can fail you. You don’t need the math to use it. You need the two failure modes it combines.

The first failure is missing things. The tool looks at a sheet with one hundred receptacles and reports eighty. The twenty it dropped are real scope that won’t show up in your bid. That’s how you win a job and then lose money on it, because the quantities you priced were short. The technical term is recall — of everything that was really there, how much did the tool catch?

The second failure is inventing things. The tool reports one hundred and twenty receptacles when there are one hundred. The extra twenty are phantom scope. That’s how you price yourself out of a bid you should have won, padding the number with quantities that don’t exist. The technical term is precision — of everything the tool reported, how much was actually real?

Here’s why you can’t look at just one of them. A tool can score beautifully on recall by flagging everything that might be a fixture, including every smudge and shadow. It catches all the real ones because it flags nearly everything. But now it’s inventing constantly, and you’re back to checking every line. The opposite is just as bad: a tool can score beautifully on precision by only reporting the fixtures it’s absolutely certain about, while quietly dropping every ambiguous one. Everything it reports is real, but it’s missing half your scope.

F1 is the score that won’t let a vendor hide behind one and ignore the other. It only goes up when the tool is both catching what’s there and not making things up. That’s why it’s the number a serious vendor publishes and a theatrical one avoids. A single accuracy percentage can be gamed. F1 is much harder to fake, because it’s an honest accounting of both ways the tool can cost you money.

Why “95% accurate” is a non-answer

When a vendor says “ninety-five percent accurate,” ask them: percent of what?

Percent of fixtures correctly classified, on which sheets, in which trades, measured against what ground truth? “Accuracy” with no denominator is a marketing number. It usually means the best result on the most cooperative sheet the vendor has, presented as if it were typical. It tells you nothing about your mechanical sheets, your electrical risers, or the messy as-built set you actually have to bid.

A real accuracy claim has shape. It names the trade, because a tool that’s strong on flooring can be weak on ductwork. It names the floor, not the ceiling, because the worst case is what bites you on deadline. And it’s measured on real plan sets, not demo decks. When a vendor gives you a single round number with none of that structure, they’re not lying exactly. They’re just not telling you anything you can underwrite a bid against.

The questions that separate measured tools from hopeful ones

You don’t have to be technical to run this test. Take these five questions into any construction AI evaluation, and watch how the vendor responds.

1. What’s your F1, broken out by trade? A serious vendor has different numbers for electrical, mechanical, flooring, and structural, because the work genuinely differs by trade. One blended number across everything is a tell that they’re averaging away their weak spots.

2. Is that a floor or a best case? Ask whether the number is the worst result you should expect or the best they’ve ever seen. The honest answer is a floor on real customer sheets. “Up to” is a best case dressed as a promise.

3. What was it measured against? Real plan sets from real projects, or a curated demo library? Ask to run the tool live on a sheet you bring, not one they’ve pre-loaded. The reaction to that request tells you everything.

4. Show me a failure. Ask the vendor to show you a case where the tool got it wrong and explain why. A team that measures itself honestly can do this in thirty seconds and isn’t embarrassed by it. A team that can’t has never looked.

5. How do I know on my own work? When the answer comes back, how does the tool tell you what it’s unsure about? A measured tool flags its own low-confidence calls so you check the right things. A hopeful tool hands you a confident table and lets you discover the errors in the field.

If a vendor can answer those five cleanly, you’re talking to people who measure themselves. If the answers turn into “very high accuracy” and “customers love it,” you already have your answer, and it isn’t a number.

Why we published ours

We measured our own takeoffs the hard way, against real customer plan sets, trade by trade, and we published the floors: electrical at or above 0.90, architectural at or above 0.85, mechanical at or above 0.74. Those aren’t best cases. They’re the numbers we’re willing to be held to on work we haven’t seen before.

We did it for a simple reason. The fastest way to earn an estimator’s trust is to show them the number they’re afraid to ask for, before they ask. The mechanical floor is lower than the others, and we publish it anyway, because hiding it would tell you we were the kind of vendor this whole piece is warning you about. A floor you can verify is worth more than a ceiling you have to take on faith.

The industry is going to keep getting louder about AI. More booths, more demos, more confident tables populating in ninety seconds. The noise isn’t the signal. The number is. Ask for the F1. Ask for it by trade. Ask whether it’s a floor. And if the vendor won’t talk about it, you’ve learned the most important thing about their tool already.

Walk the floor. Drop a real plan set into the one that will show you its number. See it on your own work tonight.