Full Width [alt+shift+f] Shortcuts [alt+shift+k]
Sign Up [alt+shift+s] Log In [alt+shift+l]
19
Intro Hi there! In this post, I want to show off a fun little web app I made for visualizing parking tickets in Chicago, but because I've spent so much time on the overall project, I figured I'd share the story that got me to this point. In many ways, this work is the foundation for my interest in public records and transparency, so it has a very special place in my heart. If you're here just to play around with a cool app, click here. Otherwise, I hope you enjoy this post and find it interesting. Also - please don't hesitate to share harsh criticism or suggestions. Enjoy! Project Beginnings Back in 2014, while on vacation, my car was towed for allegedly being in a construction zone. The cost to get the car out was quoted at around $700 through tow fees and storage fees that increase each day by $35. That's obviously a lot of money, so the night I noticed the car was towed, I began looking for ways to get the ticket thrown out in court. After some digging, I found a dataset on...
over a year ago

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Civic Hax

That Time the City of Seattle Accidentally Gave Me 32m Emails for 40 Dollars

Background In my last post, I wrote about my adventure of requesting metadata for both phone calls and emails from the City of Chicago Office of the Mayor. The work there - and its associated frustration - sent me down a path of sending requests throughout the US to both learn whether these sorts of problems are systemic (megaspoiler: they are) and to also start mapping communication across the United States. Since then, I’ve submitted over a hundred requests for email metadata across the United States – at least two per state. The first large batch of requests for email metadata were sent to the largest cities of fourteen arbitrary states in a trial run of sorts. In the end of that batch, only two cities were willing to continue with the request - Houston and Seattle. Houston complied surprisingly quickly and snail mailed the metadata for 6m emails. Seattle on the other hand... The Request On April 2, 2017 I sent this fairly boilerplate request to Seattle's IT department: For all emails sent to/from any Seattle owned email address in 2017, please provide the following information: 1. From address 2. To address 3. bcc addresses 4. cc addresses 5. Time 6. Date Technically this request can done with a single line powershell command. At a policy level, though, it usually gets a lot of pushback. Seattle's first response included a bit of gobsmackery that I’ve almost become used to: Based on my preliminary research, there have been 5.5 million emails sent and 26.8 million emails received by seattle.gov email addresses in the past 90 days. This is a significant amount of records that will need to be reviewed prior to sharing them with you. Do you have a more targeted list of email addresses you might be interested in? If not, I will work to find out how long review will take and will be in touch. Of course, I still want the records, so - I would like to stick with this request as-is for all ~32m emails.  Since this request is for metadata only, the amount of review needed should be relatively small. Fee Estimate of 33 Million Dollars A week later, I received this glorious response. Each paragraph is interesting in itself, so let’s break most of it down piece by piece. Rewording of request This acknowledges receipt of your public disclosure request C012032-041017 received on April 03, 2017 regarding: All emails sent to/from any Seattle owned email address in 2017 including metadata:1. From address2. To address3. bcc addresses4. cc addresses5. Time6. Date Notice the change of language from the original wording. Their rewording completely changes the scope of the request so that it's not just for metadata, but also the emails contents. No idea why they did that. Salary Fees With that being said, Seattle IT estimates spending 30 seconds to two minutes to review each email. It has been estimated that this work will take approximately 320 years of staff time at an expense of $33 million in salary. Wot. Normally, a flustered public records officer would just reject a giant request for being for “unduly burdensome”… but this sort of estimate is practically unheard of. So much so that other FOIA nerds have told me that this is the second biggest request they've ever seen. The passive aggression is thick. Needless to say, it's not something I'm willing to pay for! The estimate of 30s per metadata entry is also a bit suspect. Especially with the use of Excel, which would be useful for removing duplicates, etc. Storage Fees We estimate that this request contains 8-10 terabytes of information, for which we could need to stand up an FTP server through which the requester will be able to download cleared email meta data. As allowed under RCW 42.56.120, we would charge the requester for the actual copying costs of fulfilling this request. Based on the Seattle IT cost model used for internal City charge backs, the anticipated cost to the requester is $2,480 per year plus $2.11 per gigabyte of storage. We are still working on the storage requirements for this effort. If we assume 10 TB of storage, this would require $21,606.40/year in requester fees. Heh. Any sysadmin can tell you that The costs of storage doesn’t exactly come from the storage medium itself; administration costs, supporting hardware, etc, are the bulk of the costs. But come on, let’s be realistic here. There’s very little room for good faith in their cost estimates – especially since the last time a single gigabyte cost that much was between 2002 and 2004. That said – some other interesting things going on here: Their file size estimation is huge. For comparison, that Houston’s email metadata dump was only 1.2GB. The fact that they mention “meta data” [sic] implies that they did acknowledge that the request was for metadata. Seattle already uses Amazon S3 to store public records requests’ data. At the time, S3 was charging $.023/GB Continue anyway? At this time, the City anticipates that it will be able to provide a first installment of records on or about May 29, 2017. However, please note that this time estimate may change depending on the clarification you provide and as we continue to process your request.  If the City does not hear from you within 30 days, the City will consider your request closed. Oddly, they don't actually close out the request and instead ask whether I wanted to continue or not. I responded to their amazing email by asking how many records I'd receive on May 29th, but never received an answer back. Reversal of the Original Cost Estimate On June 5, they sent a new response admitting that their initial fee estimation was wrong, and asked for $1.25 for two days’ (out of three months) of records: At this point I can send you an exel spreadsheet with the data points you are requesting.  The cost for the first installment is $1.25 for emails sent or received on January 1 and 2, 2017.  Because the spreadsheet does not contain the body of the email and just the metadata that you requested, no review will be necessary, and we'll be able to get this information to you at a faster pace than the 320 years quoted you earlier.  Because they're asking for a single check for $1.25 for just two days’ worth of metadata – and wouldn’t send anything until that first check came in, my interpretation is that they’re taking a page from /r/maliciouscompliance and just making this request as painful as possible just for the simple sake of making it difficult. So in response, I preemptively sent them fourteen separate checks. The first thirteen checks were all around ~$1.25. That seemed to work, since they never asked for a single payment afterwards. From there, my inbox went mostly silent for two months, and I mostly forgot about the request, though they eventually cashed all of the checks and made me an account for their public records portal. SNAFU Fast forward to August 22, when I randomly added that email account back to my phone. Unexpectedly, it turned out they actually finished the request! And without a bill for millions of dollars! Sure enough, their public records request portal had about 400 files available to download, which all in all contained metadata for about 32 million emails. Neat! Problem though... they accidentally included the first 256 characters of all 32 million emails. Here are some things I found in the emails: Usernames and passwords. Credit card numbers. Social security numbers and drivers licenses. Ongoing police investigations and arrest reports. Texts of cheating husbands to their lovers. FBI Investigations. Zabbix alerts. In other words... they just leaked to me a massive dataset filled with intimately private information. In the process, they very likely broke many laws, including the Privacy Act of 1974 and many of WA's own public records laws. Frankly, I'm still at a loss of words. It’s hard to say how any of this happened exactly, but odds are that a combination of request’s rewording and the original public records officer going on vacation led to a communication breakdown. I don’t want to dwell on the mistake itself, so I’ll stop it at that. Side note to Seattle's IT department - clean up your disks. You shouldn't have that many disks at 100%! Raising the Issue I responded as passively as possible in the hopes that they’d catch their mistake on their own: The responsive records are not consistent with my request and includes much more info than I initially requested. Could you please revisit this request and provide the records responsive to my initial request? Their response: The information that you requestedis located in columns: From address = column J To address = column K bcc address = column M cc address = column L Time and date = column R  of the reports.  The records were generatedfrom a system report and I am unable to limit the report to generate only thefields you requested.  The City has no duty to create a record that doesnot exist.  As such, we have provided all records responsive to yourrequest and consider your request closed. Disregarding the fact that they used a very common tactic of denying information on the basis that its disclosure would require the creation of new records… they didn’t get the point. I explained what information they leaked, and made it very clear how I was going to escalate this: Please address this matter as if it was a large data breach. For now, I will be raising this matter to the WA Office of Privacy and Data Protection. None of the files provided to me have been shared with anyone else, nor do I have any future intention of sharing. Their response: Thank you for your email and bringing this inadvertent error to our attention so quickly. We have temporarily suspended access to GovQA while we look into the cause of this issue. We are also working on reprocessing your request and anticipate providing you with corrected copies of the records you requested through GovQA next week. In the meantime, please do not review, share, copy or otherwise use these records for any purpose. We are sorry for any inconvenience. Phone Call Not too long after that, after contacting some folks on Seattle’s Open Data Slack, I found my way onto a conference phone call with both Seattle’s Chief Technology officer and their Chief Privacy Officer and we discussed what happened, and what should happen with the records. They thanked me for bringing the situation to their attention and all that, but the mood of the call was as if both parties had a knife behind their back. Somewhere towards the end of the call, I asked them if it was okay to keep the emails. Why not at least ask, right? Funny enough, in the middle of that question, my internet died and interrupted the call for the first time in the six months I lived in that house. Odd. It came back ten minutes later, and I dialed back into the conference line, but the mood of the call pretty much 180’d. They told me: All files were to be deleted. Seattle would hire Kroll to scan my hard drives to prove deletion. Agreeing to #1 and #2 would give me full legal indemnification. This isn't something I'm even remotely cool with, so we ended the call a couple minutes later, and agreed to have our lawyers speak going forward. Deleting the Files After that call, I asked my lawyer to reach out to their lawyer and was pretty much told that Seattle was approaching the problem as if they were pursuing Computer Fraud And Abuse (CFAA) charges. For information that they sent. Jiminey Cricket.. So, I deleted the files. Most of what happened next over a month or so was mostly between their lawyer and mine, so there’s not really that much for me to say. Early on I suggested that I write an affidavit that explains what happened, how I deleted the files, and I validated that the files were deleted. They mostly agreed, but still wanted to throw some silly assurance things my way – including asking me to run a bash script to overwrite any unused disk space with random bits. I eventually ran zerofree and fstrim instead, and they accepted the affidavit. No more legal threats from there. Seattle’s Reaction About a week after the phone call, a Seattle city employee contacted Seattle’s KIRO7 about the incident. In KIRO7's investigation, they learned that, Seattle hadn’t sent any disclosure of the leak - something required by WA’s public records request law. Only after their investigation did Seattle actually notify its employees about the emails leak. Link to their story (video inside). A week later, another article was published by Seattle’s Crosscut which goes into a lot of detail, including some history of Seattle's IT department. This line towards the bottom still makes me laugh a little: The buffer against potential legal and administrative chaos in this scenario is only that Chapman has turned out to be, as Armbruster described him, a "good Samaritan." Efforts to track down Chapman were not successful; Crosscut contacted several Matthew Chapmans who denied being the requester. On January 19th, Seattle's CTO, Michael Mattmiller gave his resignation. Whether his resignation is related to the email leak is hard to say, but I just think the timing makes it worth mentioning. Finally – The Metadata Starting January 26th, Seattle started sending installments of the email metadata I requested. So far they've sent 27 million emails. As of the writing of this post, there are only two departments who haven’t provided their email metadata: the Police Department and Human Services. You can download the raw data here. Some things about the dataset: It’s very messy – triple quotes, semicolons, commas, oh my. There are a millions of systems alerts. For seattle.gov → seattle.gov communication, there are two distinct metadata records. In any case, it's still somewhat workable, so I've been working on a proof of concept for its use in the greater context of public records laws. Not ready to talk much about it yet, so here's is a gephi graph of one day's worth of metadata. Its layout is Yifan Hu and filtered with a k-core minimum of 5 and a minimum degree of 5: Please reach out to me if you'd like to help model these networks. One Last Thing: Legislative Immunity Kerfuffle This last section might not be related, but the timing is interesting, so I feel it’s worth mentioning. On February 23 - between the first installment of email metadata and the second - WA’s legislature attempted to pass SB6617, a bill which removes requirements for disclosure of many of their records – including email exchanges - from WA’s public records laws. What’s particularly interesting about this events of this bill is that it took less than 24 hours from the time it was read for the first time to the time that it passed at both the House and Senate and sent to the Governor’s office. Seattle Times wrote a good article about it. Thankfully, after the WA governor’s office received over 6,300 phone calls, 100 letters, and over 12,500 emails, the governor ended up vetoing the bill. Neat. It's hard to say if that caused any sort of delay, but after a month and a half of waiting: How are the installments looking? I saw that there was some recent legislative immunity kerfuffle around emails. Is that related to any delays? And got this response: Good news.  The recent Washington state legislative immunity kerfuffle will not impact your installments.  We have fixed the bug that was impacting our progress and are now on our way. In fact, I'll have more records for you this week. A month later, they started sending the rest. What’s Next? The work done throughout this post has led to a massive trove of information that ought to be enormously useful in understanding the dynamics of one the US's biggest cities. A big hope in making this sort of information available to the public is that it will help in changing the dynamic of understanding what sorts of information is accessible. That said, this is just one city of many which have given me email metadata. As more of it comes through, I’ll be able to map out more and more, but the difficulty in requesting those records continues to get in the way. Once I get some of these bigger stories out of the way, I’ll start writing fewer stories and write more about public records requesting fundamentals – particularly for digital records. Next post will be about my ongoing suit against the White House OMB for email metadata from January 2017. This past Wednesday was the first court date - where the defendent's counsel never showed up. Hope you enjoyed! Tags: seattle, foia, kerfuffle, metadata

over a year ago 17 votes
Using FOIA Data and Unix to halve major source of parking tickets

Intro This'll be my first blog post on the internet, ever. Hopefully it's interesting and accurate. Please point out any mistakes if you see any! In 2016, I did some work in trying to find some hotspot areas for parking tickets to see if a bit of data munging could reduce those area's parking tickets. In the end, I only really got one cleaned up, but it was one of the most-ticketed spots in all of Chicago and led to about a 50% reduction in parking tickets. Here's a bit of that story. Getting the data through FOIA: The system Chicago uses to store its parking tickets is called CANVAS. It's short for "City of Chicago Violation, Noticing and Adjudication Business Process And System Support" [sic] and managed by IBM. Its most recent contract started in 2012, expires in 2022, and has a pricetag of over $190 million. Most of Chicago's contracts and their Requests For Procurement (RFP) PDFs are published online. In CANVAS's contracts it gives a fair amount of info on CANVAS's backend infrastructure, including the fact that it uses Oracle 10g. In other words, a FOIA request can be fulfilled by IBM running some simple SQL. CANVAS Technical Spec CANVAS Contract CANVAS Request For Procurement (RFP) and Contract With that information at hand, and a couple failed FOIA requests later, I sent this request to get the parking ticket data from Jan 1, 2009 to Mar 10, 2016: "Please provide to me all possible information on all parking tickets between 2009 and the present day. This should include any information related to the car (make, etc), license plate, ticket, ticketer, ticket reason(s), financial information (paid, etc), court information (contested, etc), situational (eg, time, location), and photos/videos. Ideally, this should also include any relevant ticket-related information stored within CANVAS. [...] This information will be used for data analysis for research. Thanks in advance, Matt Chapman" Data About a month later, a guy named Carl (in a fancy suit), handed me a CD with some extremely messy data in a semicolon delimited file named A50462_TcktsIssdSince2009.txt. The file had info on 17,806,818 parking tickets and spans from Jan 1, 2009 to Mar 10, 2016. The data itself looks like this: head -5 A50462_TcktsIssdSince2009.txt Ticket Number;License Plate Number;License Plate State;License Plate Type;Ticket Make;Issue Date;Violation Location;Violation Code;Violation Description;Badge;Unit;Ticket Queue;Hearing Dispo 39596087;zzzzzz;IL;PAS;VOLV;03/03/2003 11:25 am;3849 W CONGRESS;0976160F;EXPIRED PLATES OR TEMPORARY REGISTRATION;11870;701;Paid; 40228076;zzzzzz;IL;TRK;FORD;03/01/2003 12:29 am;3448 N OKETO;0964170A;TRUCK,RV,BUS, OR TAXI RESIDENTIAL STREET;17488;016;Define; 40480875;zzzzzz;IL;PAS;PONT;03/01/2003 09:45 pm;8135 S PERRY;0964130;PARK OR BLOCK ALLEY;17575;006;Notice; 40718783;zzzzzz;IL;PAS;ISU;03/02/2003 06:02 pm;6928 S CORNELL;0976160F;EXPIRED PLATES OR TEMPORARY REGISTRATION;7296;003;Paid;Liable Some things to note about the data: The file is semicolon delimited. Each address is hand typed on a handheld device, often with gloves. There are millions of typos in the address column. Including over 50,000 semicolons! There is no lat/lon. What that amounts to is an extremely, extremely messy and unpredictable dataset that's incredibly to difficult to accurately map to lat/lon, which is needed for any sort of comprehensive GIS analysis. There are a bunch of geocoder services that can help out here, but most of them have about a 50% accuracy rate. That said, with the help of a bit of scrubbing, that number can be boosted to closer to 90%. Another post for another time. Here’s a sample list of Lake Shore Drive typos: Laks Shore Dr Lawkeshore Dr West Lkaeshore Dr Lkae Shore Dr Lkae Shore Drive Lkaeshore Dr West Original Analysis I was particularly interested in finding areas that had hotspot areas that stood out. A lot of my time was spent just throwing hacky code at the problem and eventually wrote out two series of (hacky) commands that led to identifying a potentially fixable spot. Originally, the work and analysis I was doing was with a combination of unix commands and gnuplot. Since then, I've migrated my code to python + matplotlib + SQL. But, for the sake of this blog, I wanted to show the original analysis. Get count of tickets at addresses and ignoring first two digits: $ mawk -F';' '{print $7}' all_tickets.orig.txt | sed -r 's/^([0-9]*)[0-9][0-9] (.*)/\100 \2/' | sed -r 's/ (BLVD|ST|AV|AVE|RD)$//' | sort | uniq -c | sort -nr 79320 1900 W OGDEN 60059 1100 N STATE 50594 100 N WABASH 44503 1400 N MILWAUKEE 43121 1500 N MILWAUKEE 43030 2800 N BROADWAY 42294 2100 S ARCHER 42116 1900 W HARRISON Get count of tickets, by ticket type, at specific addresses: $ mawk -F';' '{print $9,$7}' A50462_TcktsIssdSince2009.txt | sed -r 's/ (BLVD|ST|AV|AVE|RD)$//' | sort --parallel=4 | uniq -c | sort -nr 12510 EXPIRED PLATES OR TEMPORARY REGISTRATION 5050 W 55TH 9636 PARKING/STANDING PROHIBITED ANYTIME 835 N MICHIGAN 8943 EXPIRED PLATES OR TEMPORARY REGISTRATION 1 W PARKING LOT A 6168 EXPIRED PLATES OR TEMPORARY REGISTRATION 1 W PARKING LOT E 5938 PARKING/STANDING PROHIBITED ANYTIME 500 W MADISON 5663 PARK OR STAND IN BUS/TAXI/CARRIAGE STAND 1166 N STATE 5527 EXPIRED METER OR OVERSTAY 5230 S LAKE PARK 4174 PARKING/STANDING PROHIBITED ANYTIME 1901 W HARRISON 4137 REAR AND FRONT PLATE REQUIRED 1 W PARKING LOT A Both bits of code roughly show that there's something going on at 1100N state street, and 1166 N State St looks particularly suspicious.. So, have a look at the original set of signs: Things going on that make this spot confusing: This is a taxi stand from 7pm to 5am for three cars’ lengths. Parking in a taxi stand is a $100 ticket. When this spot isn’t a taxi stand, it’s metered parking – for a parking meter beyond an alleyway. It’s possible to pay for parking here after 7pm, which makes it look like parking is acceptable – especially with the “ParkChicago” sign floating there. Confusion creates more confusion – if one car parks there, then more cars follow. Cha-ching. Contacting the 2nd Ward With all that in mind, I contacted the second ward’s alderman’s office on April 12 explaining this, and got back this response: "Hello Matt, […]The signs and the ordinance are currently being investigated. In the interim, I highly recommend that you do not park there to avoid any further tickets. Lisa Ryan Alderman Brian Hopkins - 2nd Ward" The Fix On 1/11/17 I received this email from Lisa: "Matt, I don't know if you noticed the additional signage installed on State Street at the 3 Taxi Stand. This should elevate [sic] any further confusion of vehicle parking. Thank you for your patience. Lisa Ryan" Sure enough, two new signs were added! The new taxi stand sign sets a boundary for a previously unbounded taxi stand. The No Parking sign explicitly makes it clear that parking here during taxi stand hours is a fineable offense. Neat! And then there's this guy: Results I recently decided to look at the number of tickets at that spot. Armed with a new set of data from another FOIA request, I did some analysis with python, pandas, and SQL. What I found is that the addition of a new sign effectively led to a 50% reduction in parking tickets between 1150 and 1200 N State St. Adding it all up, that's about 400 tickets fewer in 2017 and 200 so far in 2018 compared to 2016. All in all, that's about $60,000 worth in parking tickets! The drop in slope to about 50% matches perfectly with Lisa’s email: And then comparing 2016 to 2017: What's next? All in all, the number of parking tickets is going up in Chicago, and this work shows that something can be done, even if small. This work is only on one small section of road, but I'm convinced that similar work can be on a systematic scale. It's mostly just a matter of digging through data and working directly with each ward. The later analysis done here was also only done on the most recent dataset that the Department of Revenue they gave me. The two datasets have a different set of columns, so the two datasets need to be combined still. I hope to accomplish that soon! Analysis Code Data used in this blog. Tags: FOIA, parkingtickets, civics, unix, python

over a year ago 13 votes
A tale about requesting Chicago’s Mayor’s Office’s phone records.

Intro Back in 2014, I had the naive goal of finding evidence of collusion between mayoral candidates. The reasoning is longwinded and boring, so I won't go into it. My plan was to find some sort of evidence through a FOIA request or two for the mayor's phone records, find zero evidence of collusion, then move onto a different project like I normally do. What came instead was a painful year and a half struggle to get a single week's worth of phone records from Chicago's Office of the Mayor. Hope you enjoy and learn something along the way! Requests to Mayor's Office My first request was simple and assumed that the mayor's office had a modern phone. On Dec 8, 2014, I sent this anonymous FOIA request to Chicago's Office of the Mayor: Please attach all of the mayor's phone records from any city-owned phones (including cellular phones) over the past 4 years. Ten days later, I received a rejection back stating they didn't have any of the mayor's phone records: A FOIA request must be directed to the department that maintains the records you are seeking. The Mayor’s Office does not have any documents responsive to your request. Then, to test testing whether it was just the mayor whose records their office didn't maintain, I sent another request to the Office of the Mayor - this time specifically for the FOIA officer's phone records and got the same response. Maybe another department has the records? VoIP Logs Request An outstanding question I had (and still have, to some extent) is whether or not server logs are accessible through FOIA. So, to kill two birds with one stone, I sent a request for VoIP server logs to Chicago's Department of Innovation and Technology (DoIT): Please attach in a standard text, compressed format, all VoIP server logs that would contain phone numbers dialed between the dates of 11/24/14 and 12/04/14 for [the mayor's phone]. Ten days later (and five days late), I received a response that my request was being reviewed. Because they were late to respond, IL FOIA says that they can no longer reject my request if it's unduly burdensome - one of the more interesting statutory pieces of IL FOIA. The phone... records? A month goes by, and they send back a two page PDF with phone numbers whose last four digits are redacted: ...along with a two and a half page letter explaining why. I really encourage you to read it. tl;dr of their response and its records: They claim that the review/redaction process would be extremely unduly burdensome - even though they were 5 days late! The pdf includes 83 separate phone calls, with 45 unique phone numbers. The last four digit of each phone number is removed. Government issued cell phones’ numbers have been removed completely for privacy reasons. Private phone numbers aren’t being redacted to the same extent as government cell phones. Government desk phones are redacted. Their response is particularly strange, because IL FOIA says: "disclosure of information that bears on the public duties of public employees and officials shall not be considered an invasion of personal privacy." With the help of my lawyer, I sent an email to Chicago explaining this... and never received a response. Time to appeal! Request for Review In many IL FOIA rejections, the following text is written at the bottom: You have a right of review of this denial by the IL Attorney General's Public Access Counselor, who can be contacted at 500 S. Second St., Springfield, IL 62706 or by telephone at (217)(558)-0486. You may also seek judicial review of a denial under 5 ILCS 140/11 of FOIA. I went the first route by submitting a Request for Review (RFR). The RFR letter can be boiled down to: They stopped responding. Redaction favors the government personnel’s privacy over individuals’, despite FOIA statute. Their response to the original request took ten days. Seven Months Later Turns out RFRs are very, very slow. So - seven months later, I received a RFR closing letter with a non-binding opinion saying that Chicago should send the records I requested. Their reason mostly boils down to Chicago not giving sufficient reason to call the request unduly burdensome. A long month later - August 11, 2015: - Chicago responds with this, saying that my original request was for VoIP server logs, which Chicago doesn’t have: [H]e requested "VoIP server logs," which the Department has established it does not possess. As a result, the City respectfully disagrees with your direction to produce records showing telephone numbers, as there is not an outstanding FOIA request for responsive records in the possession of the Department. Sure enough, his phone is pretty ancient: Image Source Do Over And so after nine months of what felt like wasted effort, I submitted another request that same day: Please attach [...] phone numbers dialed between the dates of 11/24/14 and 12/04/14 for [the Office of the Mayor] Two weeks later - this time with an on-time extension letter - I'm sent another file that looks like this: The exact same file. They even sent the same rejection reasons! Lawsuit On 12/2/2015, Loevy & Loevy filed suit against Chicago’s DoIT. The summary of the complaint is that we disagree with their claim that to review and redact the phone records would be extremely unduly burdensome. My part in this was waiting while my lawyer did all the work. I wasn't really involved in this part, so there's really not much for me to write about. Lawsuit conclusion Six months later, on May 11, 2016, the city settled and gave me four pages of phone logs - most of which were still redacted. Some battles, eh? Interesting bits from the court document: ...DoIT and its counsel became aware that, in its August 24, 2015 response, DoIT had inadvertently misidentified the universe of responsive numbers. DoIT identified approximately 130 additional phone numbers dialed from the phones dialed within Suite 507 of City Hall, bringing the total to 171 ...FOIA only compels the production of listed numbers belonging to businesses, governmental agencies and other entities, and only those numbers which are not work-issued cell phones. ...DoIT asserted that compliance with plaintiff request was overly burdensome pursuant to Section 3(g) of FOIA. On those grounds —rather than provide no numbers at all — DoIT redacted the last four digits of all phone numbers provided ...in other words, they "googled" each number to determine whether that number was publically listed, and, if so, to whom it belonged. This resulted in the identification of 57 out of the 137 responsive numbers... Lawsuit Records All in all, the phone records contained: 171 unique phone log entries: 57 unredacted and 114 redacted. 32 unique unredacted phone numbers. 44 unique redacted phone numbers. From there, there really wasn't much to work with. Most of the phone calls were day-to-day calls to places like flower shops, doctors and restaurants. Still, some numbers are interesting: A four-hour hotel: Prestige Club: Aura Investigative services: Statewide Investigative Services and Kennealy & O'Callaghanh Michael Madigan Data: Lawsuit Records Going deeper With all of that done – a year and a half in total for one request - I wasn’t feeling satisfied and dug deeper. This time, I started approaching it methodically to build a toolchain of sorts. So, to determine whether the same length of time could be requested without another lawsuit: Please provide me with the to/from telephone numbers, duration, time and date of all calls dialed from 121 N La Salle St #507, Chicago, IL 60602 for the below dates. April 6-9, 2015 November 23-25, 2015 And sure enough, two weeks later, I received two pdfs with phone records – this time with times, dates, from number and call length! Much faster now! Still, it’s lame that they’re still redacting a lot, and there wasn't anything interesting in these records. Data: Long Distance, Local Full year of records How about for a full year for a small set of previously released phone numbers? Please provide to me, for the year of 2014, the datetime and dialed-from number for the below numbers from the [Office of the Mayor] (312) 942-1222 [Statewide Investigative Services] (505) 747-4912 [Azura Investigations] (708) 272-6000 [Aura - Prestige Club] (773) 783-4855 [Kennealy & O'Callaghanh] (312) 606-9999 [Siam Rice] (312) 553-5698 [Some guy named Norman Kwong] Again, success! Siam Rice: 55 calls! Statewide Investigative Services: 8 calls Keannealy & O'Callaghanh: 10 calls Some guy named Norman Kwong: 3 calls (Interestingly enough, they didn't give me the prestige club phone numbers. Heh!) Data: Full year records Full year of records - City hall And finally, another request for previously released numbers – across all of city hall in 2014 and 2015: The phone numbers, names and call times to and from the phone numbers listed below during 2014 and 2015 [within city hall] (312) 942-1222 [Statewide Investigative Services] (773) 783-4855 [Kennealy & O'Callaghanh] (312) 346-4321 [Madigan & Getzendanner] (773) 581-8000 [Michael Madigan] (708) 272-6000 [Aura - Prestige Club] Data: City hall full year This means that a few methods of retrieving phone records are possible: A week's worth of (mostly redacted) records from a high profile office. A phone records through an office as big as the City Hall. The use of requesting unredacted phone numbers for future requests. Phone directory woes Of the 27 distinct Chicago phone numbers found within the last request’s records, only five of them could be resolved to a phone number found in Chicago’s phone directory: DEAL, AARON J KLINZMAN, GRANT T EMANUEL, RAHM NELSON, ASHLI RENEE MAGANA, JASMINE M This is a problem that I haven't solved for yet, but it should be easy enough by requesting a full phone directory from Chicago's DoIT. Anyone up for that challenge? ;) Emails? This probably deserves its own blog post, but I wanted tease it a bit, because it leads into other posts. I sent this request with the presumption that the redaction of emails would take a very long time: From all emails sent from [the Office of the Mayor] between 11/24/14 and 12/04/14, please provide me to all domain names for all email addresses in the to/cc/bcc. From each email, include the sender's address and sent times. Two months later, I received a 1,751 page document with full email addresses for to, from cc, and bcc, including the times of 18,860 separate emails to and from the mayor’s office. Neat - it only took about a year and a half to figure out how to parse the damn thing, though.... Interestingly, the mayor's email address isn't in there that often... Data: Email Metadata What's next? This whole process was a complete and total pain. The usefulness of knowing the ongoings of our government - especially at its highest levels - are critical for ensuring that our government is open and honest. It really shouldn't have been this difficult, but it was. The difficulties led me down an interesting path of doing many similar requests - and boy are there stories. Next post: The time Seattle accidentally sent me 30m emails for ~$30. Code Scraping Chicago's phone directory Parsing 1,700 page email pdf Tags: foia, phone, email, data, chicago

over a year ago 16 votes
over a year ago 20 votes

More in programming

ChatGPT Would be a Decent Policy Advisor

Revealed: How the UK tech secretary uses ChatGPT for policy advice by Chris Stokel-Walker for the New Scientist

11 hours ago 3 votes
Setting policy for strategy.

This book’s introduction started by defining strategy as “making decisions.” Then we dug into exploration, diagnosis, and refinement: three chapters where you could argue that we didn’t decide anything at all. Clarifying the problem to be solved is the prerequisite of effective decision making, but eventually decisions do have to be made. Here in this chapter on policy, and the following chapter on operations, we finally start to actually make some decisions. In this chapter, we’ll dig into: How we define policy, and how setting policy differs from operating policy as discussed in the next chapter The structured steps for setting policy How many policies should you set? Is it preferable to have one policy, many policies, or does it not matter much either way? Recurring kinds of policies that appear frequently in strategies Why it’s valuable to be intentional about your strategy’s altitude, and how engineers and executives generally maintain different altitudes in their strategies Criteria to use for evaluating whether your policies are likely to be impactful How to develop novel policies, and why it’s rare Why having multiple bundles of alternative policies is generally a phase in strategy development that indicates a gap in your diagnosis How policies that ignore constraints sound inspirational, but accomplish little Dealing with ambiguity and uncertainty created by missing strategies from cross-functional stakeholders By the end, you’ll be ready to evaluate why an existing strategy’s policies are struggling to make an impact, and to start iterating on policies for strategy of your own. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. What is policy? Policy is interpreting your diagnosis into a concrete plan. That plan will be a collection of decisions, tradeoffs, and approaches. They’ll range from coding practices, to hiring mandates, to architectural decisions, to guidance about how choices are made within your organization. An effective policy solves the entirety of the strategy’s diagnosis, although the diagnosis itself is encouraged to specify which aspects can be ignored. For example, the strategy for working with private equity ownership acknowledges in its diagnosis that they don’t have clear guidance on what kind of reduction to expect: Based on general practice, it seems likely that our new Private Equity ownership will expect us to reduce R&D headcount costs through a reduction. However, we don’t have any concrete details to make a structured decision on this, and our approach would vary significantly depending on the size of the reduction. Faced with that uncertainty, the policy simply acknowledges the ambiguity and commits to reconsider when more information becomes available: We believe our new ownership will provide a specific target for Research and Development (R&D) operating expenses during the upcoming financial year planning. We will revise these policies again once we have explicit targets, and will delay planning around reductions until we have those numbers to avoid running two overlapping processes. There are two frequent points of confusion when creating policies that are worth addressing directly: Policy is a subset of strategy, rather than the entirety of strategy, because policy is only meaningful in the context of the strategy’s diagnosis. For example, the “N-1 backfill policy” makes sense in the context of new, private equity ownership. The policy wouldn’t work well in a rapidly expanding organization. Any strategy without a policy is useless, but you’ll also find policies without context aren’t worth much either. This is particularly unfortunate, because so often strategies are communicated without those critical sections. Policy describes how tradeoffs should be made, but it doesn’t verify how the tradeoffs are actually being made in practice. The next chapter on operations covers how to inspect an organization’s behavior to ensure policies are followed. When reworking a strategy to be more readable, it often makes sense to merge policy and operation sections together. However, when drafting strategy it’s valuable to keep them separate. Yes, you might use a weekly meeting to review whether the policy is being followed, but whether it’s an effective policy is independent of having such a meeting, and what operational mechanisms you use will vary depending on the number of policies you intend to implement. With this definition in mind, now we can move onto the more interesting discussion of how to set policy. How to set policy Every part of writing a strategy feels hard when you’re doing it, but I personally find that writing policy either feels uncomfortably easy or painfully challenging. It’s never a happy medium. Fortunately, the exploration and diagnosis usually come together to make writing your policy simple: although sometimes that simple conclusion may be a difficult one to swallow. The steps I follow to write a strategy’s policy are: Review diagnosis to ensure it captures the most important themes. It doesn’t need to be perfect, but it shouldn’t have omissions so obvious that you can immediately identify them. Select policies that address the diagnosis. Explicitly match each policy to one or more diagnoses that it addresses. Continue adding policies until every diagnosis is covered. This is a broad instruction, but it’s simpler than it sounds because you’ll typically select from policies identified during your exploration phase. However, there certainly is space to tweak those policies, and to reapply familiar policies to new circumstances. If you do find yourself developing a novel policy, there’s a later section in this chapter, Developing novel policies, that addresses that topic in more detail. Consolidate policies in cases where they overlap or adjoin. For example, two policies about specific teams might be generalized into a policy about all teams in the engineering organization. Backtest policy against recent decisions you’ve made. This is particularly effective if you maintain a decision log in your organization. Mine for conflict once again, much as you did in developing your diagnosis. Emphasize feedback from teams and individuals with a different perspective than your own, but don’t wholly eliminate those that you agree with. Just as it’s easy to crowd out opposing views in diagnosis if you don’t solicit their input, it’s possible to accidentally crowd out your own perspective if you anchor too much on others’ perspectives. Consider refinement if you finish writing, and you just aren’t sure your approach works – that’s fine! Return to the refinement phase by deploying one of the refinement techniques to increase your conviction. Remember that we talk about strategy like it’s done in one pass, but almost all real strategy takes many refinement passes. The steps of writing policy are relatively pedestrian, largely because you’ve done so much of the work already in the exploration, diagnosis, and refinement steps. If you skip those phases, you’d likely follow the above steps for writing policy, but the expected quality of the policy itself would be far lower. How many policies? Addressing the entirety of the diagnosis is often complex, which is why most strategies feature a set of policies rather than just one. The strategy for decomposing a monolithic application is not one policy deciding not to decompose, but a series of four policies: Business units should always operate in their own code repository and monolith. New integrations across business unit monoliths should be done using gRPC. Except for new business unit monoliths, we don’t allow new services. Merge existing services into business-unit monoliths where you can. Four isn’t universally the right number either. It’s simply the number that was required to solve that strategy’s diagnosis. With an excellent diagnosis, your policies will often feel inevitable, and perhaps even boring. That’s great: what makes a policy good is that it’s effective, not that it’s novel or inspiring. Kinds of policies While there are so many policies you can write, I’ve found they generally fall into one of four major categories: approvals, allocations, direction, and guidance. This section introduces those categories. Approvals define the process for making a recurring decision. This might require invoking an architecture advice process, or it might require involving an authority figure like an executive. In the Index post-acquisition integration strategy, there were a number of complex decisions to be made, and the approval mechanism was: Escalations come to paired leads: given our limited shared context across teams, all escalations must come to both Stripe’s Head of Traffic Engineering and Index’s Head of Engineering. This allowed the acquired and acquiring teams to start building trust between each other by ensuring both were consulted before any decision was finalized. On the other hand, the user data access strategy’s approval strategy was more focused on managing corporate risk: Exceptions must be granted in writing by CISO. While our overarching Engineering Strategy states that we follow an advisory architecture process as described in Facilitating Software Architecture, the customer data access policy is an exception and must be explicitly approved, with documentation, by the CISO. Start that process in the #ciso channel. These two different approval processes had different goals, so they made tradeoffs differently. There are so many ways to tweak approval, allowing for many different tradeoffs between safety, productivity, and trust. Allocations describe how resources are split across multiple potential investments. Allocations are the most concrete statement of organizational priority, and also articulate the organization’s belief about how productivity happens in teams. Some companies believe you go fast by swarming more people onto critical problems. Other companies believe you go fast by forcing teams to solve problems without additional headcount. Both can work, and teach you something important about the company’s beliefs. The strategy on Uber’s service migration has two concrete examples of allocation policies. The first describes the Infrastructure engineering team’s allocation between manual provision tasks and investing into creating a self-service provisioning platform: Constrain manual provisioning allocation to maximize investment in self-service provisioning. The service provisioning team will maintain a fixed allocation of one full time engineer on manual service provisioning tasks. We will move the remaining engineers to work on automation to speed up future service provisioning. This will degrade manual provisioning in the short term, but the alternative is permanently degrading provisioning by the influx of new service requests from newly hired product engineers. The second allocation policy is implicitly noted in this strategy’s diagnosis, where it describes the allocation policy in the Engineering organization’s higher altitude strategy: Within infrastructure engineering, there is a team of four engineers responsible for service provisioning today. While our organization is growing at a similar rate as product engineering, none of that additional headcount is being allocated directly to the team working on service provisioning. We do not anticipate this changing. Allocation policies often create a surprising amount of clarity for the team, and I include them in almost every policy I write either explicitly, or implicitly in a higher altitude strategy. Direction provides explicit instruction on how a decision must be made. This is the right tool when you know where you want to go, and exactly the way that you want to get there. Direction is appropriate for problems you understand clearly, and you value consistency more than empowering individual judgment. Direction works well when you need an unambiguous policy that doesn’t leave room for interpretation. For example, Calm’s policy for working in the monolith: We write all code in the monolith. It has been ambiguous if new code (especially new application code) should be written in our JavaScript monolith, or if all new code must be written in a new service outside of the monolith. This is no longer ambiguous: all new code must be written in the monolith. In the rare case that there is a functional requirement that makes writing in the monolith implausible, then you should seek an exception as described below. In that case, the team couldn’t agree on what should go into the monolith. Individuals would often make incompatible decisions, so creating consistency required removing personal judgment from the equation. Sometimes judgment is the issue, and sometimes consistency is difficult due to misaligned incentives. A good example of this comes in strategy on working with new Private Equity ownership: We will move to an “N-1” backfill policy, where departures are backfilled with a less senior level. We will also institute a strict maximum of one Principal Engineer per business unit. It’s likely that hiring managers would simply ignore this backfill policy if it was stated more softly, although sometimes less forceful policies are useful. Guidance provides a recommendation about how a decision should be made. Guidance is useful when there’s enough nuance, ambiguity, or complexity that you can explain the desired destination, but you can’t mandate the path to reaching it. One example of guidance comes from the Index acquisition integration strategy: Minimize changes to tokenization environment: because point-of-sale devices directly work with customer payment details, the API that directly supports the point-of-sale device must live within our secured environment where payment details are stored. However, any other functionality must not be added to our tokenization environment. This might read like direction, but it’s clarifying the desired outcome of avoiding unnecessary complexity in the tokenization environment. However, it’s not able to articulate what complexity is necessary, so ultimately it’s guidance because it requires significant judgment to interpret. A second example of guidance comes in the strategy on decomposing a monolithic codebase: Merge existing services into business-unit monoliths where you can. We believe that each choice to move existing services back into a monolith should be made “in the details” rather than from a top-down strategy perspective. Consequently, we generally encourage teams to wind down their existing services outside of their business unit’s monolith, but defer to teams to make the right decision for their local context. This is another case of knowing the desired outcome, but encountering too much uncertainty to direct the team on how to get there. If you ask five engineers about whether it’s possible to merge a given service back into a monolithic codebase, they’ll probably disagree. That’s fine, and highlights the value of guidance: it makes it possible to make incremental progress in areas where more concrete direction would cause confusion. When you’re working on a strategy’s policy section, it’s important to consider all of these categories. Which feel most natural to use will vary depending on your team and role, but they’re all usable: If you’re a developer productivity team, you might have to lean heavily on guidance in your policies and increased support for that guidance within the details of your platform. If you’re an executive, you might lean heavily on direction. Indeed, you might lean too heavily on direction, where guidance often works better for areas where you understand the direction but not the path. If you’re a product engineering organization, you might have to narrow the scope of your direction to the engineers within that organization to deal with the realities of complex cross-organization dynamics. Finally, if you have a clear approach you want to take that doesn’t fit cleanly into any of these categories, then don’t let this framework dissuade you. Give it a try, and adapt if it doesn’t initially work out. Maintaining strategy altitude The chapter on when to write engineering strategy introduced the concept of strategy altitude, which is being deliberate about where certain kinds of policies are created within your organization. Without repeating that section in its entirety, it’s particularly relevant when you set policy to consider how your new policies eliminate flexibility within your organization. Consider these two somewhat opposing strategies: Stripe’s Sorbet strategy only worked in an organization that enforced the use of a single programming language across (essentially) all teams Uber’s service migration strategy worked well in an organization that was unwilling to enforce consistent programming language adoption across teams Stripe’s organization-altitude policy took away the freedom of individual teams to select their preferred technology stack. In return, they unlocked the ability to centralize investment in a powerful way. Uber went the opposite way, unlocking the ability of teams to pick their preferred technology stack, while significantly reducing their centralized teams’ leverage. Both altitudes make sense. Both have consequences. Criteria for effective policies In The Engineering Executive’s Primer’s chapter on engineering strategy, I introduced three criteria for evaluating policies. They ought to be applicable, enforced, and create leverage. Defining those a bit: Applicable: it can be used to navigate complex, real scenarios, particularly when making tradeoffs. Enforced: teams will be held accountable for following the guiding policy. Create Leverage: create compounding or multiplicative impact. The last of these three, create leverage, made sense in the context of a book about engineering executives, but probably doesn’t make as much sense here. Some policies certainly should create leverage (e.g. empower developer experience team by restricting new services), but others might not (e.g. moving to an N-1 backfill policy). Outside the executive context, what’s important isn’t necessarily creating leverage, but that a policy solves for part of the diagnosis. That leaves the other two–being applicable and enforced–both of which are necessary for a policy to actually address the diagnosis. Any policy which you can’t determine how to apply, or aren’t willing to enforce, simply won’t be useful. Let’s apply these criteria to a handful of potential policies. First let’s think about policies we might write to improve the talent density of our engineering team: “We only hire world-class engineers.” This isn’t applicable, because it’s unclear what a world-class engineer means. Because there’s no mutually agreeable definition in this policy, it’s also not consistently enforceable. “We only hire engineers that get at least one ‘strong yes’ in scorecards.” This is applicable, because there’s a clear definition. This is enforceable, depending on the willingness of the organization to reject seemingly good candidates who don’t happen to get a strong yes. Next, let’s think about a policy regarding code reuse within a codebase: “We follow a strict Don’t Repeat Yourself policy in our codebase.” There’s room for debate within a team about whether two pieces of code are truly duplicative, but this is generally applicable. Because there’s room for debate, it’s a very context specific determination to decide how to enforce a decision. “Code authors are responsible for determining if their contributions violate Don’t Repeat Yourself, and rewriting them if they do.” This is much more applicable, because now there’s only a single person’s judgment to assess the potential repetition. In some ways, this policy is also more enforceable, because there’s no longer any ambiguity around who is deciding whether a piece of code is a repetition. The challenge is that enforceability now depends on one individual, and making this policy effective will require holding individuals accountable for the quality of their judgement. An organization that’s unwilling to distinguish between good and bad judgment won’t get any value out of the policy. This is a good example of how a good policy in one organization might become a poor policy in another. If you ever find yourself wanting to include a policy that for some reason either can’t be applied or can’t be enforced, stop to ask yourself what you’re trying to accomplish and ponder if there’s a different policy that might be better suited to that goal. Developing novel policies My experience is that there are vanishingly few truly novel policies to write. There’s almost always someone else has already done something similar to your intended approach. Calm’s engineering strategy is such a case: the details are particular to the company, but the general approach is common across the industry. The most likely place to find truly novel policies is during the adoption phase of a new widespread technology, such as the rise of ubiquitous mobile phones, cloud computing, or large language models. Even then, as explored in the strategy for adopting large-language models, the new technology can be engaged with as a generic technology: Develop an LLM-backed process for reactivating departed and suspended drivers in mature markets. Through modeling our driver lifecycle, we determined that improving onboarding time will have little impact on the total number of active drivers. Instead, we are focusing on mechanisms to reactivate departed and suspended drivers, which is the only opportunity to meaningfully impact active drivers. You could simply replace “LLM” with “data-driven” and it would be equally readable. In this way, policy can generally sidestep areas of uncertainty by being a bit abstract. This avoids being overly specific about topics you simply don’t know much about. However, even if your policy isn’t novel to the industry, it might still be novel to you or your organization. The steps that I’ve found useful to debug novel policies are the same steps as running a condensed version of the strategy process, with a focus on exploration and refinement: Collect a number of similar policies, with a focus on how those policies differ from the policy you are creating Create a systems model to articulate how this policy will work, and also how it will differ from the similar policies you’re considering Run a strategy testing cycle for your proto-policy to discover any unknown-unknowns about how it works in practice Whether you run into this scenario is largely a function of the extent of your, and your organization’s, experience. Early in my career, I found myself doing novel (for me) strategy work very frequently, and these days I rarely find myself doing novel work, instead focusing on adaptation of well-known policies to new circumstances. Are competing policy proposals an anti-pattern? When creating policy, you’ll often have to engage with the question of whether you should develop one preferred policy or a series of potential strategies to pick from. Developing these is a useful stage of setting policy, but rather than helping you refine your policy, I’d encourage you to think of this as exposing gaps in your diagnosis. For example, when Stripe developed the Sorbet ruby-typing tooling, there was debate between two policies: Should we build a ruby-typing tool to allow a centralized team to gradually migrate the company to a typed codebase? Should we migrate the codebase to a preexisting strongly typed language like Golang or Java? These were, initially, equally valid hypotheses. It was only by clarifying our diagnosis around resourcing that it became clear that incurring the bulk of costs in a centralized team was clearly preferable to spreading the costs across many teams. Specifically, recognizing that we wanted to prioritize short-term product engineering velocity, even if it led to a longer migration overall. If you do develop multiple policy options, I encourage you to move the alternatives into an appendix rather than including them in the core of your strategy document. This will make it easier for readers of your final version to understand how to follow your policies, and they are the most important long-term user of your written strategy. Recognizing constraints A similar problem to competing solutions is developing a policy that you cannot possibly fund. It’s easy to get enamored with policies that you can’t meaningfully enforce, but that’s bad policy, even if it would work in an alternate universe where it was possible to enforce or resource it. To consider a few examples: The strategy for controlling access to user data might have proposed requiring manual approval by a second party of every access to customer data. However, that would have gone nowhere. Our approach to Uber’s service migration might have required more staffing for the infrastructure engineering team, but we knew that wasn’t going to happen, so it was a meaningless policy proposal to make. The strategy for navigating private equity ownership might have argued that new ownership should not hold engineering accountable to a new standard on spending. But they would have just invalidated that strategy in the next financial planning period. If you find a policy that contemplates an impractical approach, it doesn’t only indicate that the policy is a poor one, it also suggests your policy is missing an important pillar. Rather than debating the policy options, the fastest path to resolution is to align on the diagnosis that would invalidate potential paths forward. In cases where aligning on the diagnosis isn’t possible, for example because you simply don’t understand the possibilities of a new technology as encountered in the strategy for adopting LLMs, then you’ve typically found a valuable opportunity to use strategy refinement to build alignment. Dealing with missing strategies At a recent company offsite, we were debating which policies we might adopt to deal with annual plans that kept getting derailed after less than a month. Someone remarked that this would be much easier if we could get the executive team to commit to a clearer, written strategy about which business units we were prioritizing. They were, of course, right. It would be much easier. Unfortunately, it goes back to the problem we discussed in the diagnosis chapter about reframing blockers into diagnosis. If a strategy from the company or a peer function is missing, the empowering thing to do is to include the absence in your diagnosis and move forward. Sometimes, even when you do this, it’s easy to fall back into the belief that you cannot set a policy because a peer function might set a conflicting policy in the future. Whether you’re an executive or an engineer, you’ll never have the details you want to make the ideal policy. Meaningful leadership requires taking meaningful risks, which is never something that gets comfortable. Summary After working through this chapter, you know how to develop policy, how to assemble policies to solve your diagnosis, and how to avoid a number of the frequent challenges that policy writers encounter. At this point, there’s only one phase of strategy left to dig into, operating the policies you’ve created.

16 hours ago 3 votes
Fast and random sampling in SQLite

I was building a small feature for the Flickr Commons Explorer today: show a random selection of photos from the entire collection. I wanted a fast and varied set of photos. This meant getting a random sample of rows from a SQLite table (because the Explorer stores all its data in SQLite). I’m happy with the code I settled on, but it took several attempts to get right. Approach #1: ORDER BY RANDOM() My first attempt was pretty naïve – I used an ORDER BY RANDOM() clause to sort the table, then limit the results: SELECT * FROM photos ORDER BY random() LIMIT 10 This query works, but it was slow – about half a second to sample a table with 2 million photos (which is very small by SQLite standards). This query would run on every request for the homepage, so that latency is unacceptable. It’s slow because it forces SQLite to generate a value for every row, then sort all the rows, and only then does it apply the limit. SQLite is fast, but there’s only so fast you can sort millions of values. I found a suggestion from Stack Overflow user Ali to do a random sort on the id column first, pick my IDs from that, and only fetch the whole row for the photos I’m selecting: SELECT * FROM photos WHERE id IN ( SELECT id FROM photos ORDER BY RANDOM() LIMIT 10 ) This means SQLite only has to load the rows it’s returning, not every row in the database. This query was over three times faster – about 0.15s – but that’s still slower than I wanted. Approach #2: WHERE rowid > (…) Scrolling down the Stack Overflow page, I found an answer by Max Shenfield with a different approach: SELECT * FROM photos WHERE rowid > ( ABS(RANDOM()) % (SELECT max(rowid) FROM photos) ) LIMIT 10 The rowid is a unique identifier that’s used as a primary key in most SQLite tables, and it can be looked up very quickly. SQLite automatically assigns a unique rowid unless you explicitly tell it not to, or create your own integer primary key. This query works by picking a point between the biggest and smallest rowid values used in the table, then getting the rows with rowids which are higher than that point. If you want to know more, Max’s answer has a more detailed explanation. This query is much faster – around 0.0008s – but I didn’t go this route. The result is more like a random slice than a random sample. In my testing, it always returned contiguous rows – 101, 102, 103, … – which isn’t what I want. The photos in the Commons Explorer database were inserted in upload order, so photos with adjacent row IDs were uploaded at around the same time and are probably quite similar. I’d get one photo of an old plane, then nine more photos of other planes. I want more variety! (This behaviour isn’t guaranteed – if you don’t add an ORDER BY clause to a SELECT query, then the order of results is undefined. SQLite is returning rows in rowid order in my table, and a quick Google suggests that’s pretty common, but that may not be true in all cases. It doesn’t affect whether I want to use this approach, but I mention it here because I was confused about the ordering when I read this code.) Approach #3: Select random rowid values outside SQLite Max’s answer was the first time I’d heard of rowid, and it gave me an idea – what if I chose random rowid values outside SQLite? This is a less “pure” approach because I’m not doing everything in the database, but I’m happy with that if it gets the result I want. Here’s the procedure I came up with: Create an empty list to store our sample. Find the highest rowid that’s currently in use: sqlite> SELECT MAX(rowid) FROM photos; 1913389 Use a random number generator to pick a rowid between 1 and the highest rowid: >>> import random >>> random.randint(1, max_rowid) 196476 If we’ve already got this rowid, discard it and generate a new one. (The rowid is a signed, 64-bit integer, so the minimum possible value is always 1.) Look for a row with that rowid: SELECT * FROM photos WHERE rowid = 196476 If such a row exists, add it to our sample. If we have enough items in our sample, we’re done. Otherwise, return to step 3 and generate another rowid. If such a row doesn’t exist, return to step 3 and generate another rowid. This requires a bit more code, but it returns a diverse sample of photos, which is what I really care about. It’s a bit slower, but still plenty fast enough (about 0.001s). This approach is best for tables where the rowid values are mostly contiguous – it would be slower if there are lots of rowids between 1 and the max that don’t exist. If there are large gaps in rowid values, you might try multiple missing entries before finding a valid row, slowing down the query. You might want to try something different, like tracking valid rowid values separately. This is a good fit for my use case, because photos don’t get removed from Flickr Commons very often. Once a row is written, it sticks around, and over 97% of the possible rowid values do exist. Summary Here are the four approaches I tried: Approach Performance (for 2M rows) Notes ORDER BY RANDOM() ~0.5s Slowest, easiest to read WHERE id IN (SELECT id …) ~0.15s Faster, still fairly easy to understand WHERE rowid > ... ~0.0008s Returns clustered results Random rowid in Python ~0.001s Fast and returns varied results, requires code outside SQL I’m using the random rowid in Python in the Commons Explorer, trading code complexity for speed. I’m using this random sample to render a web page, so it’s important that it returns quickly – when I was testing ORDER BY RANDOM(), I could feel myself waiting for the page to load. But I’ve used ORDER BY RANDOM() in the past, especially for asynchronous data pipelines where I don’t care about absolute performance. It’s simpler to read and easier to see what’s going on. Now it’s your turn – visit the Commons Explorer and see what random gems you can find. Let me know if you spot anything cool! [If the formatting of this post looks odd in your feed reader, visit the original article]

7 hours ago 1 votes
Choosing Languages
yesterday 2 votes
05 · Syncing Keyhive

How we sync Keyhive and Automerge

yesterday 1 votes