More from Civic Hax
Intro Hi there! In this post, I want to show off a fun little web app I made for visualizing parking tickets in Chicago, but because I've spent so much time on the overall project, I figured I'd share the story that got me to this point. In many ways, this work is the foundation for my interest in public records and transparency, so it has a very special place in my heart. If you're here just to play around with a cool app, click here. Otherwise, I hope you enjoy this post and find it interesting. Also - please don't hesitate to share harsh criticism or suggestions. Enjoy! Project Beginnings Back in 2014, while on vacation, my car was towed for allegedly being in a construction zone. The cost to get the car out was quoted at around $700 through tow fees and storage fees that increase each day by $35. That's obviously a lot of money, so the night I noticed the car was towed, I began looking for ways to get the ticket thrown out in court. After some digging, I found a dataset on data.cityofchicago.org which details every street closure, including closures for construction. As luck would have it, the address that my car was towed at had zero open construction permits, which seemed like a good reason to throw out the fines. The next day I confirmed the lack of permits by calling Chicago's permit office - they mentioned they only thing found was a canceled construction permit by Comed. That same day, I scheduled a court date for the following day. When I arrived in the court room, I was immediately pulled aside by a city employee and was told that my case was being thrown out. They explained that the ticket itself "didn't have enough information", but besides that I wasn't told much! The judge then signed a form that allowed the release of my car at zero cost. At this point, I'd imagine most folk would just walk away with some frustration, but would be overall happy and move on. For whatever reason, though, I'm too stubborn for that - I couldn't get over the fact that Chicago didn't just spend five minutes checking the ticket before I appeared in court. And what bothered me more was that there isn't some sort of sorts automated check. Surely Chicago has some process in place within its $200 million ticketing system to make sure it's not giving out invalid tickets... Spoiler: it doesn't. The more I thought about it, the more I started wondering why Chicago's court system only favors those who have the time and resources to go to court and started wondering if it was possible to automate these checks myself using historical data. The hunch started based on what I'd seen from the recent "Open Gov" movement in to start making data open to the public by default. When I first started, I mostly just worked with some already publicly available datasets previously released by several news organizations. WBEZ in particular released a fairly large dataset of towing data that, while it was a compelling dataset, it unfortunately ultimately wasn't useful for my little project. So instead, I started working on a similar goal to programatically find invalid tickets in Chicago. That work and its research led to my very first FOIA request, where after some trial and error, received the records for 14m parking tickets. Trial Development and Trial Geocoding With the data at hand, I started throwing many, many unix one liners and gnuplot at the problem. This method worked well enough for small one-off checks, but for anything else, it's a complete mess of sed and awk statements. After unix-by-default I moved on to Python, where I made a small batch of surprisingly powerful, but scraggly scripts. One script, for example, would check to see whether a residential parking permit ticket was valid by comparing a ticket's address against a set of streets and its ranges found on data.cityofchicago.org. The script ended up finding a lot of residential permit tickets given outside of the sign's listed boundaries. Sadly though, none of the findings were relevant, since a ticket is given based on the physical location of the sign, not what the sign's location data represents. These sorts or problems ended up putting a halt to these sorts of scripts. With the next development iteration, I switched to visualizing parking tickets, where I essentially continued until today. Problem was, the original data had no lat/lng and had to be geocoded.. So, back when I first started doing this, there weren't that many geocoding services that were reasonably priced. Google was certainly possible, but its licensing and usage limits made Google a total no-go. The other two options I found were from Equifax (expensive) and the US Post Office (also expensive). So I went down a several hundred hour rabbit hole of attempting to geocode all addresses myself - entirely under the philosophy that all tickets had a story to tell, and anything lower than 90% geocoding was inexecusable. The first attempt at geocoding was decent, but only matched about 30% of tickets. It worked by tokenizing addresses using a wonderful python library called usaddress, then comparing the "string distance" of a street address against a known set of address already paired with lat/long. It got me to a point where I could make my first map visualization using some simple plotting: Jupyter Notebook Still, with only 30% geocoded, I felt that I could do better through many different methods that continued to use string distance - some implementations much fancier than others. Other methods, ahem, used lots of sed and vim, which I'm still not proud of. There were some attempts where I got close to 90% of addresses geocoded, though I ended up throwing each one out in fear that the lack of validation would bite me in the future. Spoiler: it did. Eventually, my life was made much easier thanks to the folks at SmartyStreets, who set me up an account with unlimited geocoding requests. By itself, SmartyStreets was able to geocode around 50% of the addresses, but when combined with some of my string-distance autocorrection, that number shot up to 70%. It wasn't perfect, but it was "good enough" to move on, and didn't really require me to trust my own set corrections and the anxiety that comes with that. Another New Dataset Fast forward about six months - I hadn't gotten much further on visualization, but instead started focusing on my first blog post. A bit around that post, ProPublica published some excellent work on parking tickets and ended up releasing its own tickets datasets to the public, one of which was geocoded. I'm happy to say I had a small part in helping out with, and then started using it for my own work in the hopes of using a common dataset between groups. More Geocoding Fun A high level of ProPublica's geocoding process is described in pretty good detail here, so if you're interested in how the geocoding was done, you should check that out. In brief, the dataset's addresses were geocoded using Geocodio. The geocoding process they used was dramatically simplified by replacing each address's last two digits with zeroes. It worked surprisingly well, and brought the percent of geocoded tickets to 99%, but this method had the unfortunate side effect of reducing mapping accuracy. So, because I have more opinions than I care to admit about geocoding, I had to look into how well the geocoding was done. Geocoding Strangeness After a few checks against the geocoded results, I noticed that a large portion of of the addresses wasn't geocoded correctly - often in strange ways. One example comes from the tendency of geocoders to favor one direction over the other for some streets. As can be seen below, the geocoding done on Michigan and Wells is completely demolished. What’s particularly notable here is that Michigan Ave has zero instances of “S Michigan” swapped to “N Michigan”. N Wells S Wells Total Total Tickets (pre-geocoding) 280,160 127,064 407,224 Unique Addresses (pre-geocoding) 3,254 3,190 6,444 Total Tickets (post-geocoding) 342,020 37,611 379,631 Unique Addresses (post-geocoding) 40 57 97 Direction is swapped 246 54,309 54,555 N Michigan S Michigan Total Total Tickets (pre-geocoding) 102,225 392,391 494,616 Unique Addresses (pre-geocoding) 2,916 16,585 19,501 Total Tickets (post-geocoding) 15,760 509,485 512,401 Unique Addresses (post-geocoding) 136 144 280 Direction is swapped 102,225 0 102,225 Interestingly, the total ticket count for Wells drops by about 30k after being geocoding. This turned out to be from Geocodio strangely renaming ‘200 S Wells’ to ‘200 W Hill St’. Similar also happens with numbered addresses similar to “83rd St”, where a steet name is often renamed to "100th". 100th street's ticket count appears to have six times as many tickets as it actually does because of this. These are just a few examples, but it's probably safe to say it's tip of the iceberg. My gut tells me that the address simplification had a major effect here. I think a simple solution of replacing the last two digits with 01 if odd, and 02 if even - instead of 0s - would do the trick. Quick Geocoding Story A few years ago, I approached the (ex) Chicago Chief of Open Data and asked if he, or someone in Chicago, was interested in a copy of geocoded addresses I’d worked on, since they don’t actually have anything geocoded (heh). He politely declined, and mentioned that since Chicago has its own geocoder, my geocoded addresses weren’t useful. I then asked if Chicago could run the parking ticket addresses through their geocoder – something that would then be requestable through FOIA. His response was, “No, because that would require a for loop”. Hmmmmm. Visualization to Validate Geocoding In one of my stranger attempts to validate the geocoding results, I wanted to see whether it was possible to use the distance a ticketer traveled to check if a particular geocode was botched or not. It ended up being less useful I'd hoped, but out of it, I made some pretty interesting looking visualizations: Left: Ticketer paths every 15 minutes with 24 hours decay (Animated). Right: All ticketer paths for a full month. This validation code, while it wasn't even remotely useful, ended up being the foundation for what I'm calling my final personal attempt to visualize Chicago's parking tickets. Final Version And with that, here is the final version of the parking ticket visualization app. Click the image below to navigate to the app. Bokeh Frontend The frontend for the app uses a python framework named Bokeh. It's an extremely powerful interactive visualization framework that dramatically reduces how much code it takes to make both simple and complex interfaces. It's not without its (many) faults, but in python land, I personally consider it to be the best available. That said, I wish the Bokeh devs would invest a lot more time improving the serverside documentation. There's just way too much guesswork, buggy behavior, and special one-offs required to match Bokeh's non-server functionality. Flask/Pandas Backend The original backend was PostgreSQL along with PostGIS (for GeoJSON creation) and TimeScaleDB (for timeseries charting). Sadly, PostgreSQL started struggling with larger queries, combined with memory exhausting itself quicker than I'd hoped. In the end, I ended up writing a Flask service that runs in PyPy with Pandas as an in-memory data store. This ended up working surprisingly well in both speed and modularity. That said, there’s still a lot more room for speed improvement - especially in caching! If I ever revisit the backend, I'll try giving PostgreSQL another chance, but throw a bunch more optimization its way. Please feel free to check out the application code. Interesting Findings To wrap this post up, I wanted to share some of the things I've found while playing around with the app. Everything below comes from screenshots from the app, so pardon any loss of information. Snow Route Tickets Outside 3AM-7AM If you park your car in a "snow route" between 3AM-7AM, you will get a ticket. Interestingly, between the hours of 8am to 11pm, over 1,000 tickets have been given over the past 15 years - mostly by CPD. Of these, only 6% have been contested – 6 of which the car owner was still liable for some reason. The other 94% should have automatically been thrown out, but Chicago has no systematic way of doing this. Shame. Highest value line = CPD's tickets. Expired Meter Within Central Business District – Outside of CBD. Downtown Chicago has an area that's officially described as the “Central Business District”. Within this area, tickets for expired meters end up costing $65 instead of the normal $50. In the past 13 years, over 52k of these tickets have been given outside of the Central Business District. From those 52k, only 5k tickets were taken to court, and 70% were thrown out. Overall, Chicago made around $725k off the $15 difference. Another instance of where Chicago could have systematically thrown out many tickets, but didn't. Downtown removed to highlight non-central business district areas. City sticker expiration tickets.. in the middle of July? Up until 2015, Chicago's city stickers expired at the end of June. A massive spike of tickets is bound to happen, but what's strange is that the spike happens in the middle of July, not the start of it. At the onset of each spikes, Chicago is making somewhere between $500k and $1m. This is a $200 ticket that has a nonstop effect on Chicago's poorer citizens. For more on this, check out this great ProPublica article. (The two different colors here come from the fact that Chicago changed the ticket description sometime in 2012.) Bike Lane Tickets With the increasing prevalence of bike lanes and the danger from cars parked in them, I was expecting to see a lot of tickets for parking in a bike lane. Unfortunately, that’s not the case. For example, in 2017, only around 3,000 tickets for parking in a bike lane were given. For more on this, click here. Wards Have different Street Cleaning times I found some oddities while searching for street cleaning tickets given at odd hours. Wards 48, 49 and 50 seem to give out parking tickets much later than the rest of Chicago. Then, starting at 6am, Ward 40 starts giving out tickets earlier than the rest of Chicago. Posted signs probably make it relatively clear that there’s going to be street cleaning, but it’s also probably fair to say that this is understandably confusing. Hopefully there’s a good reason these wards decided to be special. What’s Next? That's hard to say. There's a ton of work that needs to be done to address the systemic problems with Chicago's parking, but there just aren't enough people working on the problem. If you're interested in helping out, you should definitely download the data and start playing around with it! Also, if you're in Chicago, you should also check out ChiHackNight on Tuesday nights. Just make sure to poke some of us parking ticket nerds beforehand in #tickets on CHiHackNight's slack. That said, a lot of these issues should really be looked at by Chicago in more depth. The system used to manage parking ticket information - CANVAS - has cost Chicago around $200,000,000, and yet it's very clear that nobody from Chicago is interested in diving into the data. In fact, based on an old FOIA request, Chicago has only done a single spreadsheet's worth of analysis with its parking ticket data. I hope that some day Chicago starts automating some ticket validation workflows, but it will probably take a lot of public advocacy to compel them to do so. Unrelated Notes Recently I filed suit against Chicago after a FOIA request for the columns and table names of CANVAS was was rejected with a network security exemption. If successful, I hope that the information can be used to submit future FOIA requests using the database schema as the foundation. I'm extremely happy with the argument I made, and I plan on writing about it as the case comes to a close. Anywho, if you like this post and want to see more like it, please consider donating to my non-profit, Free Our Info, NFP which recently became a 501(c)(3). Much of the FOIA work from this blog (much of which is unwritten) is being filtered over to the NFP, so donations will allow similar work to continue. A majority of the proceedings will go directly into paying for FOIA litigation. So, the more donations I receive, the more I can hold government agencies accountable! Hope you enjoyed. Tags: tickets, foia, chicago, bokeh, visualization
Background In my last post, I wrote about my adventure of requesting metadata for both phone calls and emails from the City of Chicago Office of the Mayor. The work there - and its associated frustration - sent me down a path of sending requests throughout the US to both learn whether these sorts of problems are systemic (megaspoiler: they are) and to also start mapping communication across the United States. Since then, I’ve submitted over a hundred requests for email metadata across the United States – at least two per state. The first large batch of requests for email metadata were sent to the largest cities of fourteen arbitrary states in a trial run of sorts. In the end of that batch, only two cities were willing to continue with the request - Houston and Seattle. Houston complied surprisingly quickly and snail mailed the metadata for 6m emails. Seattle on the other hand... The Request On April 2, 2017 I sent this fairly boilerplate request to Seattle's IT department: For all emails sent to/from any Seattle owned email address in 2017, please provide the following information: 1. From address 2. To address 3. bcc addresses 4. cc addresses 5. Time 6. Date Technically this request can done with a single line powershell command. At a policy level, though, it usually gets a lot of pushback. Seattle's first response included a bit of gobsmackery that I’ve almost become used to: Based on my preliminary research, there have been 5.5 million emails sent and 26.8 million emails received by seattle.gov email addresses in the past 90 days. This is a significant amount of records that will need to be reviewed prior to sharing them with you. Do you have a more targeted list of email addresses you might be interested in? If not, I will work to find out how long review will take and will be in touch. Of course, I still want the records, so - I would like to stick with this request as-is for all ~32m emails. Since this request is for metadata only, the amount of review needed should be relatively small. Fee Estimate of 33 Million Dollars A week later, I received this glorious response. Each paragraph is interesting in itself, so let’s break most of it down piece by piece. Rewording of request This acknowledges receipt of your public disclosure request C012032-041017 received on April 03, 2017 regarding: All emails sent to/from any Seattle owned email address in 2017 including metadata:1. From address2. To address3. bcc addresses4. cc addresses5. Time6. Date Notice the change of language from the original wording. Their rewording completely changes the scope of the request so that it's not just for metadata, but also the emails contents. No idea why they did that. Salary Fees With that being said, Seattle IT estimates spending 30 seconds to two minutes to review each email. It has been estimated that this work will take approximately 320 years of staff time at an expense of $33 million in salary. Wot. Normally, a flustered public records officer would just reject a giant request for being for “unduly burdensome”… but this sort of estimate is practically unheard of. So much so that other FOIA nerds have told me that this is the second biggest request they've ever seen. The passive aggression is thick. Needless to say, it's not something I'm willing to pay for! The estimate of 30s per metadata entry is also a bit suspect. Especially with the use of Excel, which would be useful for removing duplicates, etc. Storage Fees We estimate that this request contains 8-10 terabytes of information, for which we could need to stand up an FTP server through which the requester will be able to download cleared email meta data. As allowed under RCW 42.56.120, we would charge the requester for the actual copying costs of fulfilling this request. Based on the Seattle IT cost model used for internal City charge backs, the anticipated cost to the requester is $2,480 per year plus $2.11 per gigabyte of storage. We are still working on the storage requirements for this effort. If we assume 10 TB of storage, this would require $21,606.40/year in requester fees. Heh. Any sysadmin can tell you that The costs of storage doesn’t exactly come from the storage medium itself; administration costs, supporting hardware, etc, are the bulk of the costs. But come on, let’s be realistic here. There’s very little room for good faith in their cost estimates – especially since the last time a single gigabyte cost that much was between 2002 and 2004. That said – some other interesting things going on here: Their file size estimation is huge. For comparison, that Houston’s email metadata dump was only 1.2GB. The fact that they mention “meta data” [sic] implies that they did acknowledge that the request was for metadata. Seattle already uses Amazon S3 to store public records requests’ data. At the time, S3 was charging $.023/GB Continue anyway? At this time, the City anticipates that it will be able to provide a first installment of records on or about May 29, 2017. However, please note that this time estimate may change depending on the clarification you provide and as we continue to process your request. If the City does not hear from you within 30 days, the City will consider your request closed. Oddly, they don't actually close out the request and instead ask whether I wanted to continue or not. I responded to their amazing email by asking how many records I'd receive on May 29th, but never received an answer back. Reversal of the Original Cost Estimate On June 5, they sent a new response admitting that their initial fee estimation was wrong, and asked for $1.25 for two days’ (out of three months) of records: At this point I can send you an exel spreadsheet with the data points you are requesting. The cost for the first installment is $1.25 for emails sent or received on January 1 and 2, 2017. Because the spreadsheet does not contain the body of the email and just the metadata that you requested, no review will be necessary, and we'll be able to get this information to you at a faster pace than the 320 years quoted you earlier. Because they're asking for a single check for $1.25 for just two days’ worth of metadata – and wouldn’t send anything until that first check came in, my interpretation is that they’re taking a page from /r/maliciouscompliance and just making this request as painful as possible just for the simple sake of making it difficult. So in response, I preemptively sent them fourteen separate checks. The first thirteen checks were all around ~$1.25. That seemed to work, since they never asked for a single payment afterwards. From there, my inbox went mostly silent for two months, and I mostly forgot about the request, though they eventually cashed all of the checks and made me an account for their public records portal. SNAFU Fast forward to August 22, when I randomly added that email account back to my phone. Unexpectedly, it turned out they actually finished the request! And without a bill for millions of dollars! Sure enough, their public records request portal had about 400 files available to download, which all in all contained metadata for about 32 million emails. Neat! Problem though... they accidentally included the first 256 characters of all 32 million emails. Here are some things I found in the emails: Usernames and passwords. Credit card numbers. Social security numbers and drivers licenses. Ongoing police investigations and arrest reports. Texts of cheating husbands to their lovers. FBI Investigations. Zabbix alerts. In other words... they just leaked to me a massive dataset filled with intimately private information. In the process, they very likely broke many laws, including the Privacy Act of 1974 and many of WA's own public records laws. Frankly, I'm still at a loss of words. It’s hard to say how any of this happened exactly, but odds are that a combination of request’s rewording and the original public records officer going on vacation led to a communication breakdown. I don’t want to dwell on the mistake itself, so I’ll stop it at that. Side note to Seattle's IT department - clean up your disks. You shouldn't have that many disks at 100%! Raising the Issue I responded as passively as possible in the hopes that they’d catch their mistake on their own: The responsive records are not consistent with my request and includes much more info than I initially requested. Could you please revisit this request and provide the records responsive to my initial request? Their response: The information that you requestedis located in columns: From address = column J To address = column K bcc address = column M cc address = column L Time and date = column R of the reports. The records were generatedfrom a system report and I am unable to limit the report to generate only thefields you requested. The City has no duty to create a record that doesnot exist. As such, we have provided all records responsive to yourrequest and consider your request closed. Disregarding the fact that they used a very common tactic of denying information on the basis that its disclosure would require the creation of new records… they didn’t get the point. I explained what information they leaked, and made it very clear how I was going to escalate this: Please address this matter as if it was a large data breach. For now, I will be raising this matter to the WA Office of Privacy and Data Protection. None of the files provided to me have been shared with anyone else, nor do I have any future intention of sharing. Their response: Thank you for your email and bringing this inadvertent error to our attention so quickly. We have temporarily suspended access to GovQA while we look into the cause of this issue. We are also working on reprocessing your request and anticipate providing you with corrected copies of the records you requested through GovQA next week. In the meantime, please do not review, share, copy or otherwise use these records for any purpose. We are sorry for any inconvenience. Phone Call Not too long after that, after contacting some folks on Seattle’s Open Data Slack, I found my way onto a conference phone call with both Seattle’s Chief Technology officer and their Chief Privacy Officer and we discussed what happened, and what should happen with the records. They thanked me for bringing the situation to their attention and all that, but the mood of the call was as if both parties had a knife behind their back. Somewhere towards the end of the call, I asked them if it was okay to keep the emails. Why not at least ask, right? Funny enough, in the middle of that question, my internet died and interrupted the call for the first time in the six months I lived in that house. Odd. It came back ten minutes later, and I dialed back into the conference line, but the mood of the call pretty much 180’d. They told me: All files were to be deleted. Seattle would hire Kroll to scan my hard drives to prove deletion. Agreeing to #1 and #2 would give me full legal indemnification. This isn't something I'm even remotely cool with, so we ended the call a couple minutes later, and agreed to have our lawyers speak going forward. Deleting the Files After that call, I asked my lawyer to reach out to their lawyer and was pretty much told that Seattle was approaching the problem as if they were pursuing Computer Fraud And Abuse (CFAA) charges. For information that they sent. Jiminey Cricket.. So, I deleted the files. Most of what happened next over a month or so was mostly between their lawyer and mine, so there’s not really that much for me to say. Early on I suggested that I write an affidavit that explains what happened, how I deleted the files, and I validated that the files were deleted. They mostly agreed, but still wanted to throw some silly assurance things my way – including asking me to run a bash script to overwrite any unused disk space with random bits. I eventually ran zerofree and fstrim instead, and they accepted the affidavit. No more legal threats from there. Seattle’s Reaction About a week after the phone call, a Seattle city employee contacted Seattle’s KIRO7 about the incident. In KIRO7's investigation, they learned that, Seattle hadn’t sent any disclosure of the leak - something required by WA’s public records request law. Only after their investigation did Seattle actually notify its employees about the emails leak. Link to their story (video inside). A week later, another article was published by Seattle’s Crosscut which goes into a lot of detail, including some history of Seattle's IT department. This line towards the bottom still makes me laugh a little: The buffer against potential legal and administrative chaos in this scenario is only that Chapman has turned out to be, as Armbruster described him, a "good Samaritan." Efforts to track down Chapman were not successful; Crosscut contacted several Matthew Chapmans who denied being the requester. On January 19th, Seattle's CTO, Michael Mattmiller gave his resignation. Whether his resignation is related to the email leak is hard to say, but I just think the timing makes it worth mentioning. Finally – The Metadata Starting January 26th, Seattle started sending installments of the email metadata I requested. So far they've sent 27 million emails. As of the writing of this post, there are only two departments who haven’t provided their email metadata: the Police Department and Human Services. You can download the raw data here. Some things about the dataset: It’s very messy – triple quotes, semicolons, commas, oh my. There are a millions of systems alerts. For seattle.gov → seattle.gov communication, there are two distinct metadata records. In any case, it's still somewhat workable, so I've been working on a proof of concept for its use in the greater context of public records laws. Not ready to talk much about it yet, so here's is a gephi graph of one day's worth of metadata. Its layout is Yifan Hu and filtered with a k-core minimum of 5 and a minimum degree of 5: Please reach out to me if you'd like to help model these networks. One Last Thing: Legislative Immunity Kerfuffle This last section might not be related, but the timing is interesting, so I feel it’s worth mentioning. On February 23 - between the first installment of email metadata and the second - WA’s legislature attempted to pass SB6617, a bill which removes requirements for disclosure of many of their records – including email exchanges - from WA’s public records laws. What’s particularly interesting about this events of this bill is that it took less than 24 hours from the time it was read for the first time to the time that it passed at both the House and Senate and sent to the Governor’s office. Seattle Times wrote a good article about it. Thankfully, after the WA governor’s office received over 6,300 phone calls, 100 letters, and over 12,500 emails, the governor ended up vetoing the bill. Neat. It's hard to say if that caused any sort of delay, but after a month and a half of waiting: How are the installments looking? I saw that there was some recent legislative immunity kerfuffle around emails. Is that related to any delays? And got this response: Good news. The recent Washington state legislative immunity kerfuffle will not impact your installments. We have fixed the bug that was impacting our progress and are now on our way. In fact, I'll have more records for you this week. A month later, they started sending the rest. What’s Next? The work done throughout this post has led to a massive trove of information that ought to be enormously useful in understanding the dynamics of one the US's biggest cities. A big hope in making this sort of information available to the public is that it will help in changing the dynamic of understanding what sorts of information is accessible. That said, this is just one city of many which have given me email metadata. As more of it comes through, I’ll be able to map out more and more, but the difficulty in requesting those records continues to get in the way. Once I get some of these bigger stories out of the way, I’ll start writing fewer stories and write more about public records requesting fundamentals – particularly for digital records. Next post will be about my ongoing suit against the White House OMB for email metadata from January 2017. This past Wednesday was the first court date - where the defendent's counsel never showed up. Hope you enjoyed! Tags: seattle, foia, kerfuffle, metadata
Intro This'll be my first blog post on the internet, ever. Hopefully it's interesting and accurate. Please point out any mistakes if you see any! In 2016, I did some work in trying to find some hotspot areas for parking tickets to see if a bit of data munging could reduce those area's parking tickets. In the end, I only really got one cleaned up, but it was one of the most-ticketed spots in all of Chicago and led to about a 50% reduction in parking tickets. Here's a bit of that story. Getting the data through FOIA: The system Chicago uses to store its parking tickets is called CANVAS. It's short for "City of Chicago Violation, Noticing and Adjudication Business Process And System Support" [sic] and managed by IBM. Its most recent contract started in 2012, expires in 2022, and has a pricetag of over $190 million. Most of Chicago's contracts and their Requests For Procurement (RFP) PDFs are published online. In CANVAS's contracts it gives a fair amount of info on CANVAS's backend infrastructure, including the fact that it uses Oracle 10g. In other words, a FOIA request can be fulfilled by IBM running some simple SQL. CANVAS Technical Spec CANVAS Contract CANVAS Request For Procurement (RFP) and Contract With that information at hand, and a couple failed FOIA requests later, I sent this request to get the parking ticket data from Jan 1, 2009 to Mar 10, 2016: "Please provide to me all possible information on all parking tickets between 2009 and the present day. This should include any information related to the car (make, etc), license plate, ticket, ticketer, ticket reason(s), financial information (paid, etc), court information (contested, etc), situational (eg, time, location), and photos/videos. Ideally, this should also include any relevant ticket-related information stored within CANVAS. [...] This information will be used for data analysis for research. Thanks in advance, Matt Chapman" Data About a month later, a guy named Carl (in a fancy suit), handed me a CD with some extremely messy data in a semicolon delimited file named A50462_TcktsIssdSince2009.txt. The file had info on 17,806,818 parking tickets and spans from Jan 1, 2009 to Mar 10, 2016. The data itself looks like this: head -5 A50462_TcktsIssdSince2009.txt Ticket Number;License Plate Number;License Plate State;License Plate Type;Ticket Make;Issue Date;Violation Location;Violation Code;Violation Description;Badge;Unit;Ticket Queue;Hearing Dispo 39596087;zzzzzz;IL;PAS;VOLV;03/03/2003 11:25 am;3849 W CONGRESS;0976160F;EXPIRED PLATES OR TEMPORARY REGISTRATION;11870;701;Paid; 40228076;zzzzzz;IL;TRK;FORD;03/01/2003 12:29 am;3448 N OKETO;0964170A;TRUCK,RV,BUS, OR TAXI RESIDENTIAL STREET;17488;016;Define; 40480875;zzzzzz;IL;PAS;PONT;03/01/2003 09:45 pm;8135 S PERRY;0964130;PARK OR BLOCK ALLEY;17575;006;Notice; 40718783;zzzzzz;IL;PAS;ISU;03/02/2003 06:02 pm;6928 S CORNELL;0976160F;EXPIRED PLATES OR TEMPORARY REGISTRATION;7296;003;Paid;Liable Some things to note about the data: The file is semicolon delimited. Each address is hand typed on a handheld device, often with gloves. There are millions of typos in the address column. Including over 50,000 semicolons! There is no lat/lon. What that amounts to is an extremely, extremely messy and unpredictable dataset that's incredibly to difficult to accurately map to lat/lon, which is needed for any sort of comprehensive GIS analysis. There are a bunch of geocoder services that can help out here, but most of them have about a 50% accuracy rate. That said, with the help of a bit of scrubbing, that number can be boosted to closer to 90%. Another post for another time. Here’s a sample list of Lake Shore Drive typos: Laks Shore Dr Lawkeshore Dr West Lkaeshore Dr Lkae Shore Dr Lkae Shore Drive Lkaeshore Dr West Original Analysis I was particularly interested in finding areas that had hotspot areas that stood out. A lot of my time was spent just throwing hacky code at the problem and eventually wrote out two series of (hacky) commands that led to identifying a potentially fixable spot. Originally, the work and analysis I was doing was with a combination of unix commands and gnuplot. Since then, I've migrated my code to python + matplotlib + SQL. But, for the sake of this blog, I wanted to show the original analysis. Get count of tickets at addresses and ignoring first two digits: $ mawk -F';' '{print $7}' all_tickets.orig.txt | sed -r 's/^([0-9]*)[0-9][0-9] (.*)/\100 \2/' | sed -r 's/ (BLVD|ST|AV|AVE|RD)$//' | sort | uniq -c | sort -nr 79320 1900 W OGDEN 60059 1100 N STATE 50594 100 N WABASH 44503 1400 N MILWAUKEE 43121 1500 N MILWAUKEE 43030 2800 N BROADWAY 42294 2100 S ARCHER 42116 1900 W HARRISON Get count of tickets, by ticket type, at specific addresses: $ mawk -F';' '{print $9,$7}' A50462_TcktsIssdSince2009.txt | sed -r 's/ (BLVD|ST|AV|AVE|RD)$//' | sort --parallel=4 | uniq -c | sort -nr 12510 EXPIRED PLATES OR TEMPORARY REGISTRATION 5050 W 55TH 9636 PARKING/STANDING PROHIBITED ANYTIME 835 N MICHIGAN 8943 EXPIRED PLATES OR TEMPORARY REGISTRATION 1 W PARKING LOT A 6168 EXPIRED PLATES OR TEMPORARY REGISTRATION 1 W PARKING LOT E 5938 PARKING/STANDING PROHIBITED ANYTIME 500 W MADISON 5663 PARK OR STAND IN BUS/TAXI/CARRIAGE STAND 1166 N STATE 5527 EXPIRED METER OR OVERSTAY 5230 S LAKE PARK 4174 PARKING/STANDING PROHIBITED ANYTIME 1901 W HARRISON 4137 REAR AND FRONT PLATE REQUIRED 1 W PARKING LOT A Both bits of code roughly show that there's something going on at 1100N state street, and 1166 N State St looks particularly suspicious.. So, have a look at the original set of signs: Things going on that make this spot confusing: This is a taxi stand from 7pm to 5am for three cars’ lengths. Parking in a taxi stand is a $100 ticket. When this spot isn’t a taxi stand, it’s metered parking – for a parking meter beyond an alleyway. It’s possible to pay for parking here after 7pm, which makes it look like parking is acceptable – especially with the “ParkChicago” sign floating there. Confusion creates more confusion – if one car parks there, then more cars follow. Cha-ching. Contacting the 2nd Ward With all that in mind, I contacted the second ward’s alderman’s office on April 12 explaining this, and got back this response: "Hello Matt, […]The signs and the ordinance are currently being investigated. In the interim, I highly recommend that you do not park there to avoid any further tickets. Lisa Ryan Alderman Brian Hopkins - 2nd Ward" The Fix On 1/11/17 I received this email from Lisa: "Matt, I don't know if you noticed the additional signage installed on State Street at the 3 Taxi Stand. This should elevate [sic] any further confusion of vehicle parking. Thank you for your patience. Lisa Ryan" Sure enough, two new signs were added! The new taxi stand sign sets a boundary for a previously unbounded taxi stand. The No Parking sign explicitly makes it clear that parking here during taxi stand hours is a fineable offense. Neat! And then there's this guy: Results I recently decided to look at the number of tickets at that spot. Armed with a new set of data from another FOIA request, I did some analysis with python, pandas, and SQL. What I found is that the addition of a new sign effectively led to a 50% reduction in parking tickets between 1150 and 1200 N State St. Adding it all up, that's about 400 tickets fewer in 2017 and 200 so far in 2018 compared to 2016. All in all, that's about $60,000 worth in parking tickets! The drop in slope to about 50% matches perfectly with Lisa’s email: And then comparing 2016 to 2017: What's next? All in all, the number of parking tickets is going up in Chicago, and this work shows that something can be done, even if small. This work is only on one small section of road, but I'm convinced that similar work can be on a systematic scale. It's mostly just a matter of digging through data and working directly with each ward. The later analysis done here was also only done on the most recent dataset that the Department of Revenue they gave me. The two datasets have a different set of columns, so the two datasets need to be combined still. I hope to accomplish that soon! Analysis Code Data used in this blog. Tags: FOIA, parkingtickets, civics, unix, python
More in programming
Many hypergrowth companies of the 2010s battled increasing complexity in their codebase by decomposing their monoliths. Stripe was somewhat of an exception, largely delaying decomposition until it had grown beyond three thousand engineers and had accumulated a decade of development in its core Ruby monolith. Even now, significant portions of their product are maintained in the monolithic repository, and it’s safe to say this was only possible because of Sorbet’s impact. Sorbet is a custom static type checker for Ruby that was initially designed and implemented by Stripe engineers on their Product Infrastructure team. Stripe’s Product Infrastructure had similar goals to other companies’ Developer Experience or Developer Productivity teams, but it focused on improving productivity through changes in the internal architecture of the codebase itself, rather than relying solely on external tooling or processes. This strategy explains why Stripe chose to delay decomposition for so long, and how the Product Infrastructure team invested in developer productivity to deal with the challenges of a large Ruby codebase managed by a large software engineering team with low average tenure caused by rapid hiring. Before wrapping this introduction, I want to explicitly acknowledge that this strategy was spearheaded by Stripe’s Product Infrastructure team, not by me. Although I ultimately became responsible for that team, I can’t take credit for this strategy’s thinking. Rather, I was initially skeptical, preferring an incremental migration to an existing strongly-typed programming language, either Java for library coverage or Golang for Stripe’s existing familiarity. Despite my initial doubts, the Sorbet project eventually won me over with its indisputable results. This is an exploratory, draft chapter for a book on engineering strategy that I’m brainstorming in #eng-strategy-book. As such, some of the links go to other draft chapters, both published drafts and very early, unpublished drafts. Reading this document To apply this strategy, start at the top with Policy. To understand the thinking behind this strategy, read sections in reverse order, starting with Explore. More detail on this structure in Making a readable Engineering Strategy document. Policy & Operation The Product Infrastructure team is investing in Stripe’s developer experience by: Every six months, Product Infrastructure will select its three highest priority areas to focus, and invest a significant majority of its energy into those. We will provide minimal support for other areas. We commit to refreshing our priorities every half after running the developer productivity survey. We will further share our results, and priorities, in each Quarterly Business Review. Our three highest priority areas for this half are: Add static typing to the highest value portions of our Ruby codebase, such that we can run the type checker locally and on the test machines to identify errors more quickly. Support selective test execution such that engineers can quickly determine and run the most appropriate tests on their machine rather than delaying until tests run on the build server. Instrument test failures such that we have better data to prioritize future efforts. Static typing is not a typical solution to developer productivity, so it requires some explanation when we say this is our highest priority area for investment. Doubly so when we acknowledge that it will take us 12-24 months of much of the team’s time to get our type checker to an effective place. Our type checker, which we plan to name Sorbet, will allow us to continue developing within our existing Ruby codebase. It will further allow our product engineers to remain focused on developing new functionality rather than migrating existing functionality to new services or programming languages. Instead, our Product Infrastructure team will centrally absorb both the development of the type checker and the initial rollout to our codebase. It’s possible for Product Infrastructure to take on both, despite its fixed size. We’ll rely on a hybrid approach of deep-dives to add typing to particularly complex areas, and scripts to rewrite our code’s Abstract Syntax Trees (AST) for less complex portions. In the relatively unlikely event that this approach fails, the cost to Stripe is of a small, known size: approximately six months of half the Product Infrastructure team, which is what we anticipate requiring to determine if this approach is viable. Based on our knowledge of Facebook’s Hack project, we believe we can build a static type checker that runs locally and significantly faster than our test suite. It’s hard to make a precise guess now, but we think less than 30 seconds to type our entire codebase, despite it being quite large. This will allow for a highly productive local development experience, even if we are not able to speed up local testing. Even if we do speed up local testing, typing would help us eliminate one of the categories of errors that testing has been unable to eliminate, which is passing of unexpected types across code paths which have been tested for expected scenarios but not for entirely unexpected scenarios. Once the type checker has been validated, we can incrementally prioritize adding typing to the highest value places across the codebase. We do not need to wholly type our codebase before we can start getting meaningful value. In support of these static typing efforts, we will advocate for product engineers at Stripe to begin development using the Command Query Responsibility Segregation (CQRS) design pattern, which we believe will provide high-leverage interfaces for incrementally introducing static typing into our codebase. Selective test execution will allow developers to quickly run appropriate tests locally. This will allow engineers to stay in a tight local development loop, speeding up development of high quality code. Given that our codebase is not currently statically typed, inferring which tests to run is rather challenging. With our very high test coverage, and the fact that all tests will still be run before deployment to the production environment, we believe that we can rely on statistically inferring which tests are likely to fail when a given file is modified. Instrumenting test failures is our third, and lowest priority, project for this half. Our focus this half is purely on annotating errors for which we have high conviction about their source, whether infrastructure or test issues. For escalations and issues, reach out in the #product-infra channel. Diagnose In 2017, Stripe is a company of about 1,000 people, including 400 software engineers. We aim to grow our organization by about 70% year-over-year to meet increasing demand for a broader product portfolio and to scale our existing products and infrastructure to accommodate user growth. As our production stability has improved over the past several years, we have now turned our focus towards improving developer productivity. Our current diagnosis of our developer productivity is: We primarily fund developer productivity for our Ruby-authoring software engineers via our Product Infrastructure team. The Ruby-focused portion of that team has about ten engineers on it today, and is unlikely to significantly grow in the future. (If we do expand, we are likely to staff non-Ruby ecosystems like Scala or Golang.) We have two primary mechanisms for understanding our engineer’s developer experience. The first is standard productivity metrics around deploy time, deploy stability, test coverage, test time, test flakiness, and so on. The second is a twice annual developer productivity survey. Looking at our productivity metrics, our test coverage remains extremely high, with coverage above 99% of lines, and tests are quite slow to run locally. They run quickly in our infrastructure because they are multiplexed across a large fleet of test runners. Tests have become slow enough to run locally that an increasing number of developers run an overly narrow subset of tests, or entirely skip running tests until after pushing their changes. They instead rely on our test servers to run against their pull request’s branch, which works well enough, but significantly slows down developer iteration time because the merge, build, and test cycle takes twenty to thirty minutes to complete. By the time their build-test cycle completes, they’ve lost their focus and maybe take several hours to return to addressing the results. There is significant disagreement about whether tests are becoming flakier due to test infrastructure issues, or due to quality issues of the tests themselves. At this point, there is no trustworthy dataset that allows us to attribute between those two causes. Feedback from the twice annual developer productivity survey supports the above diagnosis, and adds some additional nuance. Most concerning, although long-tenured Stripe engineers find themselves highly productive in our codebase, we increasingly hear in the survey that newly hired engineers with long tenures at other companies find themselves unproductive in our codebase. Specifically, they find it very difficult to determine how to safely make changes in our codebase. Our product codebase is entirely implemented in a single Ruby monolith. There is one narrow exception, a Golang service handling payment tokenization, which we consider out of scope for two reasons. First, it is kept intentionally narrow in order to absorb our SOC1 compliance obligations. Second, developers in that environment have not raised concerns about their productivity. Our data infrastructure is implemented in Scala. While these developers have concerns–primarily slow build times–they manage their build and deployment infrastructure independently, and the group remains relatively small. Ruby is not a highly performant programming language, but we’ve found it sufficiently efficient for our needs. Similarly, other languages are more cost-efficient from a compute resources perspective, but a significant majority of our spend is on real-time storage and batch computation. For these reasons alone, we would not consider replacing Ruby as our core programming language. Our Product Infrastructure team is about ten engineers, supporting about 250 product engineers. We anticipate this group growing modestly over time, but certainly sublinearly to the overall growth of product engineers. Developers working in Golang and Scala routinely ask for more centralized support, but it’s challenging to prioritize those requests as we’re forced to consider the return on improving the experience for 240 product engineers working in Ruby vs 10 in Golang or 40 data engineers in Scala. If we introduced more programming languages, this prioritization problem would become increasingly difficult, and we are already failing to support additional languages.
The new AMD HX370 option in the Framework 13 is a good step forward in performance for developers. It runs our HEY test suite in 2m7s, compared to 2m43s for the 7840U (and 2m49s for a M4 Pro!). It's also about 20% faster in most single-core tasks than the 7840U. But is that enough to warrant the jump in price? AMD's latest, best chips have suddenly gotten pretty expensive. The F13 w/ HX370 now costs $1,992 with 32GB RAM / 1TB. Almost the same an M4 Pro MBP14 w/ 24GB / 1TB ($2,199). I'd pick the Framework any day for its better keyboard, 3:2 matte screen, repairability, and superb Linux compatibility, but it won't be because the top option is "cheaper" any more. Of course you could also just go with the budget 6-core Ryzen AI 5 340 in same spec for $1,362. I'm sure that's a great machine too. But maybe the sweet spot is actually the Ryzen AI 7 350. It "only" has 8 cores (vs 12 on the 370), but four of those are performance cores -- the same as the 370. And it's $300 cheaper. So ~$1,600 gets you out the door. I haven't actually tried the 350, though, so that's just speculation. I've been running the 370 for the last few months. Whichever chip you choose, the rest of the Framework 13 package is as good as it ever was. This remains my favorite laptop of at least the last decade. I've been running one for over a year now, and combined with Omakub + Neovim, it's the first machine in forever where I've actually enjoyed programming on a 13" screen. The 3:2 aspect ratio combined with Linux's superb multiple desktops that switch with 0ms lag and no animations means I barely miss the trusted 6K Apple XDR screen when working away from the desk. The HX370 gives me about 6 hours of battery life in mixed use. About the same as the old 7840U. Though if all I'm doing is writing, I can squeeze that to 8-10 hours. That's good enough for me, but not as good as a Qualcomm machine or an Apple M-chip machine. For some people, those extra hours really make the difference. What does make a difference, of course, is Linux. I've written repeatedly about how much of a joy it's been to rediscover Linux on the desktop, and it's a joy that keeps on giving. For web work, it's so good. And for any work that requires even a minimum of Docker, it's so fast (as the HEY suite run time attests). Apple still has a strong hardware game, but their software story is falling apart. I haven't heard many people sing the praises of new iOS or macOS releases in a long while. It seems like without an asshole in charge, both have move towards more bloat, more ads, more gimmicks, more control. Linux is an incredible antidote to this nonsense these days. It's also just fun! Seeing AMD catch up in outright performance if not efficiency has been a delight. Watching Framework perfect their 13" laptop while remaining 100% backwards compatible in terms of upgrades with the first versions is heartwarming. And getting to test the new Framework Desktop in advance of its Q3 release has only affirmed my commitment to both. But on the new HX370, it's in my opinion the best Linux laptop you can buy today, which by extension makes it the best web developer laptop too. The top spec might have gotten a bit pricey, but there are options all along the budget spectrum, which retains all the key ingredients any way. Hard to go wrong. Forza Framework!
I’m a big fan of keyring, a Python module made by Jason R. Coombs for storing secrets in the system keyring. It works on multiple operating systems, and it knows what password store to use for each of them. For example, if you’re using macOS it puts secrets in the Keychain, but if you’re on Windows it uses Credential Locker. The keyring module is a safe and portable way to store passwords, more secure than using a plaintext config file or an environment variable. The same code will work on different platforms, because keyring handles the hard work of choosing which password store to use. It has a straightforward API: the keyring.set_password and keyring.get_password functions will handle a lot of use cases. >>> import keyring >>> keyring.set_password("xkcd", "alexwlchan", "correct-horse-battery-staple") >>> keyring.get_password("xkcd", "alexwlchan") "correct-horse-battery-staple" Although this API is simple, it’s not perfect – I have some frustrations with the get_password function. In a lot of my projects, I’m now using a small function that wraps get_password. What do I find frustrating about keyring.get_password? If you look up a password that isn’t in the system keyring, get_password returns None rather than throwing an exception: >>> print(keyring.get_password("xkcd", "the_invisible_man")) None I can see why this makes sense for the library overall – a non-existent password is very normal, and not exceptional behaviour – but in my projects, None is rarely a usable value. I normally use keyring to retrieve secrets that I need to access protected resources – for example, an API key to call an API that requires authentication. If I can’t get the right secrets, I know I can’t continue. Indeed, continuing often leads to more confusing errors when some other function unexpectedly gets None, rather than a string. For a while, I wrapped get_password in a function that would throw an exception if it couldn’t find the password: def get_required_password(service_name: str, username: str) -> str: """ Get password from the specified service. If a matching password is not found in the system keyring, this function will throw an exception. """ password = keyring.get_password(service_name, username) if password is None: raise RuntimeError(f"Could not retrieve password {(service_name, username)}") return password When I use this function, my code will fail as soon as it fails to retrieve a password, rather than when it tries to use None as the password. This worked well enough for my personal projects, but it wasn’t a great fit for shared projects. I could make sense of the error, but not everyone could do the same. What’s that password meant to be? A good error message explains what’s gone wrong, and gives the reader clear steps for fixing the issue. The error message above is only doing half the job. It tells you what’s gone wrong (it couldn’t get the password) but it doesn’t tell you how to fix it. As I started using this snippet in codebases that I work on with other developers, I got questions when other people hit this error. They could guess that they needed to set a password, but the error message doesn’t explain how, or what password they should be setting. For example, is this a secret they should pick themselves? Is it a password in our shared password vault? Or do they need an API key for a third-party service? If so, where do they find it? I still think my initial error was an improvement over letting None be used in the rest of the codebase, but I realised I could go further. This is my extended wrapper: def get_required_password(service_name: str, username: str, explanation: str) -> str: """ Get password from the specified service. If a matching password is not found in the system keyring, this function will throw an exception and explain to the user how to set the required password. """ password = keyring.get_password(service_name, username) if password is None: raise RuntimeError( "Unable to retrieve required password from the system keyring!\n" "\n" "You need to:\n" "\n" f"1/ Get the password. Here's how: {explanation}\n" "\n" "2/ Save the new password in the system keyring:\n" "\n" f" keyring set {service_name} {username}\n" ) return password The explanation argument allows me to explain what the password is for to a future reader, and what value it should have. That information can often be found in a code comment or in documentation, but putting it in an error message makes it more visible. Here’s one example: get_required_password( "flask_app", "secret_key", explanation=( "Pick a random value, e.g. with\n" "\n" " python3 -c 'import secrets; print(secrets.token_hex())'\n" "\n" "This password is used to securely sign the Flask session cookie. " "See https://flask.palletsprojects.com/en/stable/config/#SECRET_KEY" ), ) If you call this function and there’s no keyring entry for flask_app/secret_key, you get the following error: Unable to retrieve required password from the system keyring! You need to: 1/ Get the password. Here's how: Pick a random value, e.g. with python3 -c 'import secrets; print(secrets.token_hex())' This password is used to securely sign the Flask session cookie. See https://flask.palletsprojects.com/en/stable/config/#SECRET_KEY 2/ Save the new password in the system keyring: keyring set flask_app secret_key It’s longer, but this error message is far more informative. It tells you what’s wrong, how to save a password, and what the password should be. This is based on a real example where the previous error message led to a misunderstanding. A co-worker saw a missing password called “secret key” and thought it referred to a secret key for calling an API, and didn’t realise it was actually for signing Flask session cookies. Now I can write a more informative error message, I can prevent that misunderstanding happening again. (We also renamed the secret, for additional clarity.) It takes time to write this explanation, which will only ever be seen by a handful of people, but I think it’s important. If somebody sees it at all, it’ll be when they’re setting up the project for the first time. I want that setup process to be smooth and straightforward. I don’t use this wrapper in all my code, particularly small or throwaway toys that won’t last long enough for this to be an issue. But in larger codebases that will be used by other developers, and which I expect to last a long time, I use it extensively. Writing a good explanation now can avoid frustration later. [If the formatting of this post looks odd in your feed reader, visit the original article]
Short one this time because I have a lot going on this week. In computation complexity, NP is the class of all decision problems (yes/no) where a potential proof (or "witness") for "yes" can be verified in polynomial time. For example, "does this set of numbers have a subset that sums to zero" is in NP. If the answer is "yes", you can prove it by presenting a set of numbers. We would then verify the witness by 1) checking that all the numbers are present in the set (~linear time) and 2) adding up all the numbers (also linear). NP-complete is the class of "hardest possible" NP problems. Subset sum is NP-complete. NP-hard is the set all problems at least as hard as NP-complete. Notably, NP-hard is not a subset of NP, as it contains problems that are harder than NP-complete. A natural question to ask is "like what?" And the canonical example of "NP-harder" is the halting problem (HALT): does program P halt on input C? As the argument goes, it's undecidable, so obviously not in NP. I think this is a bad example for two reasons: All NP requires is that witnesses for "yes" can be verified in polynomial time. It does not require anything for the "no" case! And even though HP is undecidable, there is a decidable way to verify a "yes": let the witness be "it halts in N steps", then run the program for that many steps and see if it halted by then. To prove HALT is not in NP, you have to show that this verification process grows faster than polynomially. It does (as busy beaver is uncomputable), but this all makes the example needlessly confusing.1 "What's bigger than a dog? THE MOON" Really (2) bothers me a lot more than (1) because it's just so inelegant. It suggests that NP-complete is the upper bound of "solvable" problems, and after that you're in full-on undecidability. I'd rather show intuitive problems that are harder than NP but not that much harder. But in looking for a "slightly harder" problem, I ran into an, ah, problem. It seems like the next-hardest class would be EXPTIME, except we don't know for sure that NP != EXPTIME. We know for sure that NP != NEXPTIME, but NEXPTIME doesn't have any intuitive, easily explainable problems. Most "definitely harder than NP" problems require a nontrivial background in theoretical computer science or mathematics to understand. There is one problem, though, that I find easily explainable. Place a token at the bottom left corner of a grid that extends infinitely up and right, call that point (0, 0). You're given list of valid displacement moves for the token, like (+1, +0), (-20, +13), (-5, -6), etc, and a target point like (700, 1). You may make any sequence of moves in any order, as long as no move ever puts the token off the grid. Does any sequence of moves bring you to the target? This is PSPACE-complete, I think, which still isn't proven to be harder than NP-complete (though it's widely believed). But what if you increase the number of dimensions of the grid? Past a certain number of dimensions the problem jumps to being EXPSPACE-complete, and then TOWER-complete (grows tetrationally), and then it keeps going. Some point might recognize this as looking a lot like the Ackermann function, and in fact this problem is ACKERMANN-complete on the number of available dimensions. A friend wrote a Quanta article about the whole mess, you should read it. This problem is ludicrously bigger than NP ("Chicago" instead of "The Moon"), but at least it's clearly decidable, easily explainable, and definitely not in NP. It's less confusing if you're taught the alternate (and original!) definition of NP, "the class of problems solvable in polynomial time by a nondeterministic Turing machine". Then HALT can't be in NP because otherwise runtime would be bounded by an exponential function. ↩
I’ve been writing some internal dashboards recently, and one hard part is displaying timestamps. Our server does everything in UTC, but the team is split across four different timezones, so the server timestamps aren’t always easy to read. For most people, it’s harder to understand a UTC timestamp than a timestamp in your local timezone. Did that event happen just now, an hour ago, or much further back? Was that at the beginning of your working day? Or at the end? Then I remembered that I tried to solve this five years ago at a previous job. I wrote a JavaScript snippet that converts UTC timestamps into human-friendly text. It displays times in your local time zone, and adds a short suffix if the time happened recently. For example: today @ 12:00 BST (1 hour ago) In my old project, I was using writing timestamps in a <div> and I had to opt into the human-readable text for every date on the page. It worked, but it was a bit fiddly. Doing it again, I thought of a more elegant solution. HTML has a <time> element for expressing datetimes, which is a more meaningful wrapper than a <div>. When I render the dashboard on the server, I don’t know the user’s timezone, so I include the UTC timestamp in the page like so: <time datetime="2025-04-15 19:45:00Z"> Tue, 15 Apr 2025 at 19:45 UTC </time> I put a machine-readable date and time string with a timezone offset string in the datetime attribute, and then a more human-readable string in the text of the element. Then I add this JavaScript snippet to the page: window.addEventListener("DOMContentLoaded", function() { document.querySelectorAll("time").forEach(function(timeElem) { // Set the `title` attribute to the original text, so a user // can hover over a timestamp to see the UTC time. timeElem.setAttribute("title", timeElem.innerText); // Replace the display text with a human-friendly date string // which is localised to the user's timezone. timeElem.innerText = getHumanFriendlyDateString( timeElem.getAttribute("datetime") ); }) }); This updates any <time> element on the page to use a human friendly date string, which is localised to the user’s timezone. For example, I’m in the UK so that becomes: <time datetime="2025-04-15 19:45:00Z" title="Tue, 15 Apr 2025 at 19:45 UTC"> Tue, 15 Apr 2025 at 20:45 BST </time> In my experience, these timestamps are easier and more intuitive for people to read. I always include a timezone string (e.g. BST, EST, PDT) so it’s obvious that I’m showing a localised timestamp. If you really need the UTC timestamp, it’s in the title attribute, so you can see it by hovering over it. (Sorry, mouseless users, but I don’t think any of my team are browsing our dashboards from their phone or tablet.) If the JavaScript doesn’t load, you see the plain old UTC timestamp. It’s not ideal, but the page still loads and you can see all the information – this behaviour is an enhancement, not an essential. To me, this is the unfulfilled promise of the <time> element. In my fantasy world, web page authors would write the time in a machine-readable format, and browsers would show it in a way that makes sense for the reader. They’d take into account their language, locale, and time zone. I understand why that hasn’t happened – it’s much easier said than done. You need so much context to know what’s the “right” thing to do when dealing with datetimes, and guessing without that context is at the heart of many datetime bugs. These sort of human-friendly, localised timestamps are very handy sometimes, and a complete mess at other times. In my staff-only dashboards, I have that context. I know what these timestamps mean, who’s going to be reading them, and I think they’re a helpful addition that makes the data easier to read. [If the formatting of this post looks odd in your feed reader, visit the original article]