Assistant conversation

try playing with the new spreadsheet querying feature, it’s duckdb based and should be pretty cool

Listing spreadsheets 3 spreadsheets 18 sheets
gold_standard_v12.xlsx

bid_level 3700 rows x 1 columns
tender_level 1278 rows x 17 columns
company_id_map 1038 rows x 7 columns
population 913 rows x 34 columns
IUB_metadata 80 rows x 27 columns
special_cases_guide 7 rows x 2 columns
VARIABLE_dictionary_v2 5 rows x 4 columns
variable_dictionary 41 rows x 7 columns
Primary_sources 17 rows x 2 columns
normalization_rules 10 rows x 5 columns
inventory.xlsx

Sheet1 17098 rows x 17 columns
inventory_v3.xlsx

tender_level (2) 29 rows x 15 columns
Inventory 330 rows x 16 columns
IUB 28 rows x 54 columns
bid_level 151 rows x 11 columns
tender_level 28 rows x 8 columns
Pivot1 8 rows x 2 columns
CPV codes 44 rows x 2 columns

Quick profile of available procurement spreadsheet tables and missingness-relevant fields 6 rows of 6

WITH sheets AS (
  SELECT 'gold.population' AS sheet, count(*) AS rows FROM xlsx_gold_standard_v12_b56445cb_4
  UNION ALL SELECT 'gold.tender_level', count(*) FROM xlsx_gold_standard_v12_b56445cb_2
  UNION ALL SELECT 'gold.bid_level', count(*) FROM xlsx_gold_standard_v12_b56445cb_1
  UNION ALL SELECT 'inventory_v3.tender_level', count(*) FROM xlsx_inventory_v3_4b022d33_5
  UNION ALL SELECT 'inventory_v3.bid_level', count(*) FROM xlsx_inventory_v3_4b022d33_4
  UNION ALL SELECT 'inventory.files', count(*) FROM xlsx_inventory_830a48f1_1
)
SELECT * FROM sheets ORDER BY rows DESC;

sheet	rows
inventory.files	17098
gold.bid_level	3700
gold.tender_level	1278
gold.population	913
inventory_v3.bid_level	151
inventory_v3.tender_level	28

Open saved result #SPELA2

Explore distribution of gold-standard population by year, KIBS class, gold-standard flag, and document availability 16 rows of 16

SELECT
  source_year,
  kibs_text_class,
  gold_standard,
  count(*) AS tenders,
  sum(CASE WHEN lower(coalesce(has_report_bool,'')) IN ('true','1','yes','jā','ja') THEN 1 ELSE 0 END) AS has_report,
  round(avg(try_cast(total_docs AS DOUBLE)), 2) AS avg_docs,
  sum(CASE WHEN lower(coalesce(meat_rich,'')) IN ('true','1','yes','jā','ja') THEN 1 ELSE 0 END) AS meat_rich_n,
  sum(CASE WHEN lower(coalesce(likely_framework,'')) IN ('true','1','yes','jā','ja') THEN 1 ELSE 0 END) AS likely_framework_n
FROM xlsx_gold_standard_v12_b56445cb_4
GROUP BY 1,2,3
ORDER BY try_cast(source_year AS INT), kibs_text_class, gold_standard;

source_year	kibs_text_class	gold_standard	tenders	has_report	avg_docs	meat_rich_n	likely_framework_n
2024	non_kibs	0	255	227	15.37	73	0
2024	non_kibs	1	23	23	15.09	9	1
2024	possible_kibs	0	45	40	21.33	22	0
2024	possible_kibs	1	8	8	16.75	3	0
2024	strict_kibs	0	141	120	16.41	51	0
2024	strict_kibs	1	21	19	19.43	13	1
2024	weak_signal	0	87	80	16.52	46	0
2024	weak_signal	1	7	6	22.57	5	0
2025	non_kibs	0	115	95	14.83	38	1
2025	non_kibs	1	5	5	17.6	1	1
2025	possible_kibs	0	38	32	17.18	19	0
2025	possible_kibs	1	1	1	10	1	0
2025	strict_kibs	0	74	58	15.04	29	0
2025	strict_kibs	1	11	11	19.09	6	1
2025	weak_signal	0	78	66	13.86	31	1
2025	weak_signal	1	4	4	12	0	0

Open saved result #8GEEAD

Summarise hand-coded 28-tender sample by preparation time, bids, award value, and single-bidder rate 2 rows of 2

SELECT
  procedure_type,
  count(*) AS tenders,
  round(avg(try_cast(days_to_prepare AS DOUBLE)), 1) AS avg_days_to_prepare,
  median(try_cast(days_to_prepare AS DOUBLE)) AS median_days_to_prepare,
  round(avg(try_cast(total_bids AS DOUBLE)), 2) AS avg_total_bids,
  median(try_cast(total_bids AS DOUBLE)) AS median_total_bids,
  sum(CASE WHEN try_cast(total_bids AS INT)=1 THEN 1 ELSE 0 END) AS single_bid_tenders,
  round(avg(try_cast(award_amount AS DOUBLE)), 0) AS avg_award_amount
FROM xlsx_inventory_v3_4b022d33_5
GROUP BY 1
ORDER BY tenders DESC;

procedure_type	tenders	avg_days_to_prepare	median_days_to_prepare	avg_total_bids	median_total_bids	single_bid_tenders	avg_award_amount
Atklāts konkurss	14	22.5	19	8.21	3.5	2	28873672
Mazie iepirkumi	14	16.6	15	2.57	2	4	27770

Open saved result #3RULBU

Check whether shorter preparation periods in the hand-coded sample are associated with fewer bids 3 rows of 3

WITH t AS (
  SELECT
    tender_id,
    procedure_type,
    try_cast(days_to_prepare AS DOUBLE) AS days_to_prepare,
    try_cast(total_bids AS DOUBLE) AS total_bids,
    try_cast(award_amount AS DOUBLE) AS award_amount,
    general_agreement
  FROM xlsx_inventory_v3_4b022d33_5
  WHERE try_cast(days_to_prepare AS DOUBLE) IS NOT NULL
    AND try_cast(total_bids AS DOUBLE) IS NOT NULL
)
SELECT
  CASE
    WHEN days_to_prepare < 10 THEN '<10 days'
    WHEN days_to_prepare < 20 THEN '10-19 days'
    WHEN days_to_prepare < 30 THEN '20-29 days'
    ELSE '30+ days'
  END AS prep_window,
  count(*) AS tenders,
  round(avg(total_bids), 2) AS avg_bids,
  median(total_bids) AS median_bids,
  sum(CASE WHEN total_bids=1 THEN 1 ELSE 0 END) AS single_bidder,
  string_agg(tender_id, ', ' ORDER BY tender_id) AS tender_ids
FROM t
GROUP BY 1
ORDER BY min(days_to_prepare);

prep_window	tenders	avg_bids	median_bids	single_bidder	tender_ids
10-19 days	19	2.79	2	5	143415, 144471, 144476, 144555, 146863, 151019, 151080, 151178, 153192, 153292, 153433, 153570, 153631, 154274, 155433, 156048, 157384, 157387, 157474
20-29 days	6	14.67	2.5	1	144497, 148738, 150902, 152877, 155963, 156355
30+ days	3	3.33	3	0	133703, 144117, 148991

Open saved result #NL8Z4X

Test cross-sheet join: attach document inventory counts to the 28 tender-level sample 10 rows of 28

WITH docs AS (
  SELECT
    procurement_id AS tender_id,
    count(*) AS n_files,
    count(DISTINCT lower(file_type)) AS n_file_types,
    sum(try_cast(bytes AS BIGINT)) AS total_bytes,
    string_agg(DISTINCT lower(file_type), ', ' ORDER BY lower(file_type)) AS file_types
  FROM xlsx_inventory_v3_4b022d33_2
  GROUP BY procurement_id
), tenders AS (
  SELECT
    tender_id,
    procedure_type,
    try_cast(days_to_prepare AS DOUBLE) AS days_to_prepare,
    try_cast(total_bids AS DOUBLE) AS total_bids,
    try_cast(award_amount AS DOUBLE) AS award_amount
  FROM xlsx_inventory_v3_4b022d33_5
)
SELECT
  t.tender_id,
  t.procedure_type,
  t.days_to_prepare,
  t.total_bids,
  d.n_files,
  d.n_file_types,
  round(d.total_bytes / 1000000.0, 2) AS mb_files,
  d.file_types
FROM tenders t
LEFT JOIN docs d USING (tender_id)
ORDER BY d.n_files DESC NULLS LAST, t.total_bids ASC;

tender_id	procedure_type	days_to_prepare	total_bids	n_files	n_file_types	mb_files	file_types
152877	Atklāts konkurss	21	6	22	3	2.37	doc, docx, pdf
146863	Mazie iepirkumi	19	4	20	6	0.65	doc, docx, edoc, id146863_mimetype, pdf, xlsx
153433	Atklāts konkurss	14	5	19	5	2.24	doc, docx, edoc, id153433_mimetype, pdf
153631	Mazie iepirkumi	17	3	18	2	0.81	docx, pdf
144471	Mazie iepirkumi	15	2	17	4	0.8	docx, edoc, id144471_mimetype, pdf
151019	Mazie iepirkumi	15	2	16	2	1.47	docx, pdf
144555	Atklāts konkurss	16	4	16	5	0.75	doc, docx, edoc, id144555_mimetype, pdf
148738	Atklāts konkurss	29	74	15	4	3.02	docx, pdf, xlsx, zip
144117	Atklāts konkurss	41	5	14	5	0.95	docx, edoc, id144117_mimetype, pdf, xlsx
157387	Atklāts konkurss	15	2	12	3	4.12	docx, pdf, xlsx

Open saved result #V53U6W

Played with it — it’s genuinely useful. I found 3 workbooks / 18 sheets, including:

gold_standard_v12.xlsx: population, tender-level, company map, variable dictionary
inventory.xlsx: large file inventory, 17,098 rows
inventory_v3.xlsx: hand-coded tender/bid sample, IUB metadata, document inventory

A few quick DuckDB tests worked well:

1. Table size profiling

Saved result:

Largest available sheets:

sheet	rows
inventory.files	17,098
gold.bid_level	3,700
gold.tender_level	1,278
gold.population	913
inventory_v3.bid_level	151
inventory_v3.tender_level	28

2. Gold-standard population by KIBS class/year

Saved result:

Useful sanity check: the gold.population sheet has 913 tenders, split across 2024–2025 and KIBS categories. Example:

2024 strict KIBS: 162 total, of which 21 are gold-standard
2025 strict KIBS: 85 total, of which 11 are gold-standard
meat_rich and likely_framework are queryable as participation-relevant tender-design flags

This is useful for checking whether the gold-standard sample is over/under-representing strict KIBS, MEAT-rich tenders, frameworks, or document-heavy cases.

3. Hand-coded 28-tender sample summary

Saved result:

procedure type	tenders	avg days to prepare	avg total bids	single bid tenders
Atklāts konkurss	14	22.5	8.21	2
Mazie iepirkumi	14	16.6	2.57	4

Nice immediately thesis-relevant signal: small procurements have shorter prep windows and fewer bids in this small sample.

4. Prep time vs bids

Saved result:

prep window	tenders	avg bids	median bids	single bidder
10–19 days	19	2.79	2	5
20–29 days	6	14.67	2.5	1
30+ days	3	3.33	3	0

The 20–29 day average is inflated by one big outlier: tender 148738 with 74 bids. Median is more stable.

5. Cross-sheet join: tender sample + document inventory

Saved result:

DuckDB joins work nicely. I joined the 28-tender sample to the document inventory and got document-burden indicators:

tender id	procedure	days	bids	n files	MB files	file types
152877	Atklāts konkurss	21	6	22	2.37	doc docx pdf
146863	Mazie iepirkumi	19	4	20	0.65	doc docx edoc pdf xlsx…
153433	Atklāts konkurss	14	5	19	2.24	doc docx edoc pdf…
148738	Atklāts konkurss	29	74	15	3.02	docx pdf xlsx zip

This is probably the coolest thesis use-case: build operational variables like:

n_files
n_file_types
total_MB
has_xlsx
has_zip
has_edoc
days_to_prepare
total_bids
single_bidder

Then test whether document complexity / bid preparation burden is associated with lower participation.

Main practical note: spreadsheet columns are imported as text, so numeric analysis needs try_cast(...), but otherwise the DuckDB workflow is smooth.