Using FOIA Data and Unix to halve major source of parking tickets

from Civic Hax [alt+shift+b] in programming

Intro This'll be my first blog post on the internet, ever. Hopefully it's interesting and accurate. Please point out any mistakes if you see any! In 2016, I did some work in trying to find some hotspot areas for parking tickets to see if a bit of data munging could reduce those area's parking tickets. In the end, I only really got one cleaned up, but it was one of the most-ticketed spots in all of Chicago and led to about a 50% reduction in parking tickets. Here's a bit of that story. Getting the data through FOIA: The system Chicago uses to store its parking tickets is called CANVAS. It's short for "City of Chicago Violation, Noticing and Adjudication Business Process And System Support" [sic] and managed by IBM. Its most recent contract started in 2012, expires in 2022, and has a pricetag of over $190 million. Most of Chicago's contracts and their Requests For Procurement (RFP) PDFs are published online. In CANVAS's contracts it gives a fair amount of info on CANVAS's backend...

over a year ago

Remove from reading list Add to reading list [alt+a] Read now [→]

Improve your reading experience

Logged in users get linked directly to articles resulting in a better reading experience. Please login for free, it takes less than 1 minute.

More from Civic Hax

Chicago Parking Ticket Visualization

Intro Hi there! In this post, I want to show off a fun little web app I made for visualizing parking tickets in Chicago, but because I've spent so much time on the overall project, I figured I'd share the story that got me to this point. In many ways, this work is the foundation for my interest in public records and transparency, so it has a very special place in my heart. If you're here just to play around with a cool app, click here. Otherwise, I hope you enjoy this post and find it interesting. Also - please don't hesitate to share harsh criticism or suggestions. Enjoy! Project Beginnings Back in 2014, while on vacation, my car was towed for allegedly being in a construction zone. The cost to get the car out was quoted at around $700 through tow fees and storage fees that increase each day by $35. That's obviously a lot of money, so the night I noticed the car was towed, I began looking for ways to get the ticket thrown out in court. After some digging, I found a dataset on data.cityofchicago.org which details every street closure, including closures for construction. As luck would have it, the address that my car was towed at had zero open construction permits, which seemed like a good reason to throw out the fines. The next day I confirmed the lack of permits by calling Chicago's permit office - they mentioned they only thing found was a canceled construction permit by Comed. That same day, I scheduled a court date for the following day. When I arrived in the court room, I was immediately pulled aside by a city employee and was told that my case was being thrown out. They explained that the ticket itself "didn't have enough information", but besides that I wasn't told much! The judge then signed a form that allowed the release of my car at zero cost. At this point, I'd imagine most folk would just walk away with some frustration, but would be overall happy and move on. For whatever reason, though, I'm too stubborn for that - I couldn't get over the fact that Chicago didn't just spend five minutes checking the ticket before I appeared in court. And what bothered me more was that there isn't some sort of sorts automated check. Surely Chicago has some process in place within its $200 million ticketing system to make sure it's not giving out invalid tickets... Spoiler: it doesn't. The more I thought about it, the more I started wondering why Chicago's court system only favors those who have the time and resources to go to court and started wondering if it was possible to automate these checks myself using historical data. The hunch started based on what I'd seen from the recent "Open Gov" movement in to start making data open to the public by default. When I first started, I mostly just worked with some already publicly available datasets previously released by several news organizations. WBEZ in particular released a fairly large dataset of towing data that, while it was a compelling dataset, it unfortunately ultimately wasn't useful for my little project. So instead, I started working on a similar goal to programatically find invalid tickets in Chicago. That work and its research led to my very first FOIA request, where after some trial and error, received the records for 14m parking tickets. Trial Development and Trial Geocoding With the data at hand, I started throwing many, many unix one liners and gnuplot at the problem. This method worked well enough for small one-off checks, but for anything else, it's a complete mess of sed and awk statements. After unix-by-default I moved on to Python, where I made a small batch of surprisingly powerful, but scraggly scripts. One script, for example, would check to see whether a residential parking permit ticket was valid by comparing a ticket's address against a set of streets and its ranges found on data.cityofchicago.org. The script ended up finding a lot of residential permit tickets given outside of the sign's listed boundaries. Sadly though, none of the findings were relevant, since a ticket is given based on the physical location of the sign, not what the sign's location data represents. These sorts or problems ended up putting a halt to these sorts of scripts. With the next development iteration, I switched to visualizing parking tickets, where I essentially continued until today. Problem was, the original data had no lat/lng and had to be geocoded.. So, back when I first started doing this, there weren't that many geocoding services that were reasonably priced. Google was certainly possible, but its licensing and usage limits made Google a total no-go. The other two options I found were from Equifax (expensive) and the US Post Office (also expensive). So I went down a several hundred hour rabbit hole of attempting to geocode all addresses myself - entirely under the philosophy that all tickets had a story to tell, and anything lower than 90% geocoding was inexecusable. The first attempt at geocoding was decent, but only matched about 30% of tickets. It worked by tokenizing addresses using a wonderful python library called usaddress, then comparing the "string distance" of a street address against a known set of address already paired with lat/long. It got me to a point where I could make my first map visualization using some simple plotting: Jupyter Notebook Still, with only 30% geocoded, I felt that I could do better through many different methods that continued to use string distance - some implementations much fancier than others. Other methods, ahem, used lots of sed and vim, which I'm still not proud of. There were some attempts where I got close to 90% of addresses geocoded, though I ended up throwing each one out in fear that the lack of validation would bite me in the future. Spoiler: it did. Eventually, my life was made much easier thanks to the folks at SmartyStreets, who set me up an account with unlimited geocoding requests. By itself, SmartyStreets was able to geocode around 50% of the addresses, but when combined with some of my string-distance autocorrection, that number shot up to 70%. It wasn't perfect, but it was "good enough" to move on, and didn't really require me to trust my own set corrections and the anxiety that comes with that. Another New Dataset Fast forward about six months - I hadn't gotten much further on visualization, but instead started focusing on my first blog post. A bit around that post, ProPublica published some excellent work on parking tickets and ended up releasing its own tickets datasets to the public, one of which was geocoded. I'm happy to say I had a small part in helping out with, and then started using it for my own work in the hopes of using a common dataset between groups. More Geocoding Fun A high level of ProPublica's geocoding process is described in pretty good detail here, so if you're interested in how the geocoding was done, you should check that out. In brief, the dataset's addresses were geocoded using Geocodio. The geocoding process they used was dramatically simplified by replacing each address's last two digits with zeroes. It worked surprisingly well, and brought the percent of geocoded tickets to 99%, but this method had the unfortunate side effect of reducing mapping accuracy. So, because I have more opinions than I care to admit about geocoding, I had to look into how well the geocoding was done. Geocoding Strangeness After a few checks against the geocoded results, I noticed that a large portion of of the addresses wasn't geocoded correctly - often in strange ways. One example comes from the tendency of geocoders to favor one direction over the other for some streets. As can be seen below, the geocoding done on Michigan and Wells is completely demolished. What’s particularly notable here is that Michigan Ave has zero instances of “S Michigan” swapped to “N Michigan”. N Wells S Wells Total Total Tickets (pre-geocoding) 280,160 127,064 407,224 Unique Addresses (pre-geocoding) 3,254 3,190 6,444 Total Tickets (post-geocoding) 342,020 37,611 379,631 Unique Addresses (post-geocoding) 40 57 97 Direction is swapped 246 54,309 54,555 N Michigan S Michigan Total Total Tickets (pre-geocoding) 102,225 392,391 494,616 Unique Addresses (pre-geocoding) 2,916 16,585 19,501 Total Tickets (post-geocoding) 15,760 509,485 512,401 Unique Addresses (post-geocoding) 136 144 280 Direction is swapped 102,225 0 102,225 Interestingly, the total ticket count for Wells drops by about 30k after being geocoding. This turned out to be from Geocodio strangely renaming ‘200 S Wells’ to ‘200 W Hill St’. Similar also happens with numbered addresses similar to “83rd St”, where a steet name is often renamed to "100th". 100th street's ticket count appears to have six times as many tickets as it actually does because of this. These are just a few examples, but it's probably safe to say it's tip of the iceberg. My gut tells me that the address simplification had a major effect here. I think a simple solution of replacing the last two digits with 01 if odd, and 02 if even - instead of 0s - would do the trick. Quick Geocoding Story A few years ago, I approached the (ex) Chicago Chief of Open Data and asked if he, or someone in Chicago, was interested in a copy of geocoded addresses I’d worked on, since they don’t actually have anything geocoded (heh). He politely declined, and mentioned that since Chicago has its own geocoder, my geocoded addresses weren’t useful. I then asked if Chicago could run the parking ticket addresses through their geocoder – something that would then be requestable through FOIA. His response was, “No, because that would require a for loop”. Hmmmmm. Visualization to Validate Geocoding In one of my stranger attempts to validate the geocoding results, I wanted to see whether it was possible to use the distance a ticketer traveled to check if a particular geocode was botched or not. It ended up being less useful I'd hoped, but out of it, I made some pretty interesting looking visualizations: Left: Ticketer paths every 15 minutes with 24 hours decay (Animated). Right: All ticketer paths for a full month. This validation code, while it wasn't even remotely useful, ended up being the foundation for what I'm calling my final personal attempt to visualize Chicago's parking tickets. Final Version And with that, here is the final version of the parking ticket visualization app. Click the image below to navigate to the app. Bokeh Frontend The frontend for the app uses a python framework named Bokeh. It's an extremely powerful interactive visualization framework that dramatically reduces how much code it takes to make both simple and complex interfaces. It's not without its (many) faults, but in python land, I personally consider it to be the best available. That said, I wish the Bokeh devs would invest a lot more time improving the serverside documentation. There's just way too much guesswork, buggy behavior, and special one-offs required to match Bokeh's non-server functionality. Flask/Pandas Backend The original backend was PostgreSQL along with PostGIS (for GeoJSON creation) and TimeScaleDB (for timeseries charting). Sadly, PostgreSQL started struggling with larger queries, combined with memory exhausting itself quicker than I'd hoped. In the end, I ended up writing a Flask service that runs in PyPy with Pandas as an in-memory data store. This ended up working surprisingly well in both speed and modularity. That said, there’s still a lot more room for speed improvement - especially in caching! If I ever revisit the backend, I'll try giving PostgreSQL another chance, but throw a bunch more optimization its way. Please feel free to check out the application code. Interesting Findings To wrap this post up, I wanted to share some of the things I've found while playing around with the app. Everything below comes from screenshots from the app, so pardon any loss of information. Snow Route Tickets Outside 3AM-7AM If you park your car in a "snow route" between 3AM-7AM, you will get a ticket. Interestingly, between the hours of 8am to 11pm, over 1,000 tickets have been given over the past 15 years - mostly by CPD. Of these, only 6% have been contested – 6 of which the car owner was still liable for some reason. The other 94% should have automatically been thrown out, but Chicago has no systematic way of doing this. Shame. Highest value line = CPD's tickets. Expired Meter Within Central Business District – Outside of CBD. Downtown Chicago has an area that's officially described as the “Central Business District”. Within this area, tickets for expired meters end up costing $65 instead of the normal $50. In the past 13 years, over 52k of these tickets have been given outside of the Central Business District. From those 52k, only 5k tickets were taken to court, and 70% were thrown out. Overall, Chicago made around $725k off the $15 difference. Another instance of where Chicago could have systematically thrown out many tickets, but didn't. Downtown removed to highlight non-central business district areas. City sticker expiration tickets.. in the middle of July? Up until 2015, Chicago's city stickers expired at the end of June. A massive spike of tickets is bound to happen, but what's strange is that the spike happens in the middle of July, not the start of it. At the onset of each spikes, Chicago is making somewhere between $500k and $1m. This is a $200 ticket that has a nonstop effect on Chicago's poorer citizens. For more on this, check out this great ProPublica article. (The two different colors here come from the fact that Chicago changed the ticket description sometime in 2012.) Bike Lane Tickets With the increasing prevalence of bike lanes and the danger from cars parked in them, I was expecting to see a lot of tickets for parking in a bike lane. Unfortunately, that’s not the case. For example, in 2017, only around 3,000 tickets for parking in a bike lane were given. For more on this, click here. Wards Have different Street Cleaning times I found some oddities while searching for street cleaning tickets given at odd hours. Wards 48, 49 and 50 seem to give out parking tickets much later than the rest of Chicago. Then, starting at 6am, Ward 40 starts giving out tickets earlier than the rest of Chicago. Posted signs probably make it relatively clear that there’s going to be street cleaning, but it’s also probably fair to say that this is understandably confusing. Hopefully there’s a good reason these wards decided to be special. What’s Next? That's hard to say. There's a ton of work that needs to be done to address the systemic problems with Chicago's parking, but there just aren't enough people working on the problem. If you're interested in helping out, you should definitely download the data and start playing around with it! Also, if you're in Chicago, you should also check out ChiHackNight on Tuesday nights. Just make sure to poke some of us parking ticket nerds beforehand in #tickets on CHiHackNight's slack. That said, a lot of these issues should really be looked at by Chicago in more depth. The system used to manage parking ticket information - CANVAS - has cost Chicago around $200,000,000, and yet it's very clear that nobody from Chicago is interested in diving into the data. In fact, based on an old FOIA request, Chicago has only done a single spreadsheet's worth of analysis with its parking ticket data. I hope that some day Chicago starts automating some ticket validation workflows, but it will probably take a lot of public advocacy to compel them to do so. Unrelated Notes Recently I filed suit against Chicago after a FOIA request for the columns and table names of CANVAS was was rejected with a network security exemption. If successful, I hope that the information can be used to submit future FOIA requests using the database schema as the foundation. I'm extremely happy with the argument I made, and I plan on writing about it as the case comes to a close. Anywho, if you like this post and want to see more like it, please consider donating to my non-profit, Free Our Info, NFP which recently became a 501(c)(3). Much of the FOIA work from this blog (much of which is unwritten) is being filtered over to the NFP, so donations will allow similar work to continue. A majority of the proceedings will go directly into paying for FOIA litigation. So, the more donations I receive, the more I can hold government agencies accountable! Hope you enjoyed. Tags: tickets, foia, chicago, bokeh, visualization

over a year ago • 30 votes

That Time the City of Seattle Accidentally Gave Me 32m Emails for 40 Dollars

Background In my last post, I wrote about my adventure of requesting metadata for both phone calls and emails from the City of Chicago Office of the Mayor. The work there - and its associated frustration - sent me down a path of sending requests throughout the US to both learn whether these sorts of problems are systemic (megaspoiler: they are) and to also start mapping communication across the United States. Since then, I’ve submitted over a hundred requests for email metadata across the United States – at least two per state. The first large batch of requests for email metadata were sent to the largest cities of fourteen arbitrary states in a trial run of sorts. In the end of that batch, only two cities were willing to continue with the request - Houston and Seattle. Houston complied surprisingly quickly and snail mailed the metadata for 6m emails. Seattle on the other hand... The Request On April 2, 2017 I sent this fairly boilerplate request to Seattle's IT department: For all emails sent to/from any Seattle owned email address in 2017, please provide the following information: 1. From address 2. To address 3. bcc addresses 4. cc addresses 5. Time 6. Date Technically this request can done with a single line powershell command. At a policy level, though, it usually gets a lot of pushback. Seattle's first response included a bit of gobsmackery that I’ve almost become used to: Based on my preliminary research, there have been 5.5 million emails sent and 26.8 million emails received by seattle.gov email addresses in the past 90 days. This is a significant amount of records that will need to be reviewed prior to sharing them with you. Do you have a more targeted list of email addresses you might be interested in? If not, I will work to find out how long review will take and will be in touch. Of course, I still want the records, so - I would like to stick with this request as-is for all ~32m emails. Since this request is for metadata only, the amount of review needed should be relatively small. Fee Estimate of 33 Million Dollars A week later, I received this glorious response. Each paragraph is interesting in itself, so let’s break most of it down piece by piece. Rewording of request This acknowledges receipt of your public disclosure request C012032-041017 received on April 03, 2017 regarding: All emails sent to/from any Seattle owned email address in 2017 including metadata:1. From address2. To address3. bcc addresses4. cc addresses5. Time6. Date Notice the change of language from the original wording. Their rewording completely changes the scope of the request so that it's not just for metadata, but also the emails contents. No idea why they did that. Salary Fees With that being said, Seattle IT estimates spending 30 seconds to two minutes to review each email. It has been estimated that this work will take approximately 320 years of staff time at an expense of $33 million in salary. Wot. Normally, a flustered public records officer would just reject a giant request for being for “unduly burdensome”… but this sort of estimate is practically unheard of. So much so that other FOIA nerds have told me that this is the second biggest request they've ever seen. The passive aggression is thick. Needless to say, it's not something I'm willing to pay for! The estimate of 30s per metadata entry is also a bit suspect. Especially with the use of Excel, which would be useful for removing duplicates, etc. Storage Fees We estimate that this request contains 8-10 terabytes of information, for which we could need to stand up an FTP server through which the requester will be able to download cleared email meta data. As allowed under RCW 42.56.120, we would charge the requester for the actual copying costs of fulfilling this request. Based on the Seattle IT cost model used for internal City charge backs, the anticipated cost to the requester is $2,480 per year plus $2.11 per gigabyte of storage. We are still working on the storage requirements for this effort. If we assume 10 TB of storage, this would require $21,606.40/year in requester fees. Heh. Any sysadmin can tell you that The costs of storage doesn’t exactly come from the storage medium itself; administration costs, supporting hardware, etc, are the bulk of the costs. But come on, let’s be realistic here. There’s very little room for good faith in their cost estimates – especially since the last time a single gigabyte cost that much was between 2002 and 2004. That said – some other interesting things going on here: Their file size estimation is huge. For comparison, that Houston’s email metadata dump was only 1.2GB. The fact that they mention “meta data” [sic] implies that they did acknowledge that the request was for metadata. Seattle already uses Amazon S3 to store public records requests’ data. At the time, S3 was charging $.023/GB Continue anyway? At this time, the City anticipates that it will be able to provide a first installment of records on or about May 29, 2017. However, please note that this time estimate may change depending on the clarification you provide and as we continue to process your request. If the City does not hear from you within 30 days, the City will consider your request closed. Oddly, they don't actually close out the request and instead ask whether I wanted to continue or not. I responded to their amazing email by asking how many records I'd receive on May 29th, but never received an answer back. Reversal of the Original Cost Estimate On June 5, they sent a new response admitting that their initial fee estimation was wrong, and asked for $1.25 for two days’ (out of three months) of records: At this point I can send you an exel spreadsheet with the data points you are requesting. The cost for the first installment is $1.25 for emails sent or received on January 1 and 2, 2017. Because the spreadsheet does not contain the body of the email and just the metadata that you requested, no review will be necessary, and we'll be able to get this information to you at a faster pace than the 320 years quoted you earlier. Because they're asking for a single check for $1.25 for just two days’ worth of metadata – and wouldn’t send anything until that first check came in, my interpretation is that they’re taking a page from /r/maliciouscompliance and just making this request as painful as possible just for the simple sake of making it difficult. So in response, I preemptively sent them fourteen separate checks. The first thirteen checks were all around ~$1.25. That seemed to work, since they never asked for a single payment afterwards. From there, my inbox went mostly silent for two months, and I mostly forgot about the request, though they eventually cashed all of the checks and made me an account for their public records portal. SNAFU Fast forward to August 22, when I randomly added that email account back to my phone. Unexpectedly, it turned out they actually finished the request! And without a bill for millions of dollars! Sure enough, their public records request portal had about 400 files available to download, which all in all contained metadata for about 32 million emails. Neat! Problem though... they accidentally included the first 256 characters of all 32 million emails. Here are some things I found in the emails: Usernames and passwords. Credit card numbers. Social security numbers and drivers licenses. Ongoing police investigations and arrest reports. Texts of cheating husbands to their lovers. FBI Investigations. Zabbix alerts. In other words... they just leaked to me a massive dataset filled with intimately private information. In the process, they very likely broke many laws, including the Privacy Act of 1974 and many of WA's own public records laws. Frankly, I'm still at a loss of words. It’s hard to say how any of this happened exactly, but odds are that a combination of request’s rewording and the original public records officer going on vacation led to a communication breakdown. I don’t want to dwell on the mistake itself, so I’ll stop it at that. Side note to Seattle's IT department - clean up your disks. You shouldn't have that many disks at 100%! Raising the Issue I responded as passively as possible in the hopes that they’d catch their mistake on their own: The responsive records are not consistent with my request and includes much more info than I initially requested. Could you please revisit this request and provide the records responsive to my initial request? Their response: The information that you requestedis located in columns: From address = column J To address = column K bcc address = column M cc address = column L Time and date = column R of the reports. The records were generatedfrom a system report and I am unable to limit the report to generate only thefields you requested. The City has no duty to create a record that doesnot exist. As such, we have provided all records responsive to yourrequest and consider your request closed. Disregarding the fact that they used a very common tactic of denying information on the basis that its disclosure would require the creation of new records… they didn’t get the point. I explained what information they leaked, and made it very clear how I was going to escalate this: Please address this matter as if it was a large data breach. For now, I will be raising this matter to the WA Office of Privacy and Data Protection. None of the files provided to me have been shared with anyone else, nor do I have any future intention of sharing. Their response: Thank you for your email and bringing this inadvertent error to our attention so quickly. We have temporarily suspended access to GovQA while we look into the cause of this issue. We are also working on reprocessing your request and anticipate providing you with corrected copies of the records you requested through GovQA next week. In the meantime, please do not review, share, copy or otherwise use these records for any purpose. We are sorry for any inconvenience. Phone Call Not too long after that, after contacting some folks on Seattle’s Open Data Slack, I found my way onto a conference phone call with both Seattle’s Chief Technology officer and their Chief Privacy Officer and we discussed what happened, and what should happen with the records. They thanked me for bringing the situation to their attention and all that, but the mood of the call was as if both parties had a knife behind their back. Somewhere towards the end of the call, I asked them if it was okay to keep the emails. Why not at least ask, right? Funny enough, in the middle of that question, my internet died and interrupted the call for the first time in the six months I lived in that house. Odd. It came back ten minutes later, and I dialed back into the conference line, but the mood of the call pretty much 180’d. They told me: All files were to be deleted. Seattle would hire Kroll to scan my hard drives to prove deletion. Agreeing to #1 and #2 would give me full legal indemnification. This isn't something I'm even remotely cool with, so we ended the call a couple minutes later, and agreed to have our lawyers speak going forward. Deleting the Files After that call, I asked my lawyer to reach out to their lawyer and was pretty much told that Seattle was approaching the problem as if they were pursuing Computer Fraud And Abuse (CFAA) charges. For information that they sent. Jiminey Cricket.. So, I deleted the files. Most of what happened next over a month or so was mostly between their lawyer and mine, so there’s not really that much for me to say. Early on I suggested that I write an affidavit that explains what happened, how I deleted the files, and I validated that the files were deleted. They mostly agreed, but still wanted to throw some silly assurance things my way – including asking me to run a bash script to overwrite any unused disk space with random bits. I eventually ran zerofree and fstrim instead, and they accepted the affidavit. No more legal threats from there. Seattle’s Reaction About a week after the phone call, a Seattle city employee contacted Seattle’s KIRO7 about the incident. In KIRO7's investigation, they learned that, Seattle hadn’t sent any disclosure of the leak - something required by WA’s public records request law. Only after their investigation did Seattle actually notify its employees about the emails leak. Link to their story (video inside). A week later, another article was published by Seattle’s Crosscut which goes into a lot of detail, including some history of Seattle's IT department. This line towards the bottom still makes me laugh a little: The buffer against potential legal and administrative chaos in this scenario is only that Chapman has turned out to be, as Armbruster described him, a "good Samaritan." Efforts to track down Chapman were not successful; Crosscut contacted several Matthew Chapmans who denied being the requester. On January 19th, Seattle's CTO, Michael Mattmiller gave his resignation. Whether his resignation is related to the email leak is hard to say, but I just think the timing makes it worth mentioning. Finally – The Metadata Starting January 26th, Seattle started sending installments of the email metadata I requested. So far they've sent 27 million emails. As of the writing of this post, there are only two departments who haven’t provided their email metadata: the Police Department and Human Services. You can download the raw data here. Some things about the dataset: It’s very messy – triple quotes, semicolons, commas, oh my. There are a millions of systems alerts. For seattle.gov → seattle.gov communication, there are two distinct metadata records. In any case, it's still somewhat workable, so I've been working on a proof of concept for its use in the greater context of public records laws. Not ready to talk much about it yet, so here's is a gephi graph of one day's worth of metadata. Its layout is Yifan Hu and filtered with a k-core minimum of 5 and a minimum degree of 5: Please reach out to me if you'd like to help model these networks. One Last Thing: Legislative Immunity Kerfuffle This last section might not be related, but the timing is interesting, so I feel it’s worth mentioning. On February 23 - between the first installment of email metadata and the second - WA’s legislature attempted to pass SB6617, a bill which removes requirements for disclosure of many of their records – including email exchanges - from WA’s public records laws. What’s particularly interesting about this events of this bill is that it took less than 24 hours from the time it was read for the first time to the time that it passed at both the House and Senate and sent to the Governor’s office. Seattle Times wrote a good article about it. Thankfully, after the WA governor’s office received over 6,300 phone calls, 100 letters, and over 12,500 emails, the governor ended up vetoing the bill. Neat. It's hard to say if that caused any sort of delay, but after a month and a half of waiting: How are the installments looking? I saw that there was some recent legislative immunity kerfuffle around emails. Is that related to any delays? And got this response: Good news. The recent Washington state legislative immunity kerfuffle will not impact your installments. We have fixed the bug that was impacting our progress and are now on our way. In fact, I'll have more records for you this week. A month later, they started sending the rest. What’s Next? The work done throughout this post has led to a massive trove of information that ought to be enormously useful in understanding the dynamics of one the US's biggest cities. A big hope in making this sort of information available to the public is that it will help in changing the dynamic of understanding what sorts of information is accessible. That said, this is just one city of many which have given me email metadata. As more of it comes through, I’ll be able to map out more and more, but the difficulty in requesting those records continues to get in the way. Once I get some of these bigger stories out of the way, I’ll start writing fewer stories and write more about public records requesting fundamentals – particularly for digital records. Next post will be about my ongoing suit against the White House OMB for email metadata from January 2017. This past Wednesday was the first court date - where the defendent's counsel never showed up. Hope you enjoyed! Tags: seattle, foia, kerfuffle, metadata

over a year ago • 28 votes

A tale about requesting Chicago’s Mayor’s Office’s phone records.

Intro Back in 2014, I had the naive goal of finding evidence of collusion between mayoral candidates. The reasoning is longwinded and boring, so I won't go into it. My plan was to find some sort of evidence through a FOIA request or two for the mayor's phone records, find zero evidence of collusion, then move onto a different project like I normally do. What came instead was a painful year and a half struggle to get a single week's worth of phone records from Chicago's Office of the Mayor. Hope you enjoy and learn something along the way! Requests to Mayor's Office My first request was simple and assumed that the mayor's office had a modern phone. On Dec 8, 2014, I sent this anonymous FOIA request to Chicago's Office of the Mayor: Please attach all of the mayor's phone records from any city-owned phones (including cellular phones) over the past 4 years. Ten days later, I received a rejection back stating they didn't have any of the mayor's phone records: A FOIA request must be directed to the department that maintains the records you are seeking. The Mayor’s Office does not have any documents responsive to your request. Then, to test testing whether it was just the mayor whose records their office didn't maintain, I sent another request to the Office of the Mayor - this time specifically for the FOIA officer's phone records and got the same response. Maybe another department has the records? VoIP Logs Request An outstanding question I had (and still have, to some extent) is whether or not server logs are accessible through FOIA. So, to kill two birds with one stone, I sent a request for VoIP server logs to Chicago's Department of Innovation and Technology (DoIT): Please attach in a standard text, compressed format, all VoIP server logs that would contain phone numbers dialed between the dates of 11/24/14 and 12/04/14 for [the mayor's phone]. Ten days later (and five days late), I received a response that my request was being reviewed. Because they were late to respond, IL FOIA says that they can no longer reject my request if it's unduly burdensome - one of the more interesting statutory pieces of IL FOIA. The phone... records? A month goes by, and they send back a two page PDF with phone numbers whose last four digits are redacted: ...along with a two and a half page letter explaining why. I really encourage you to read it. tl;dr of their response and its records: They claim that the review/redaction process would be extremely unduly burdensome - even though they were 5 days late! The pdf includes 83 separate phone calls, with 45 unique phone numbers. The last four digit of each phone number is removed. Government issued cell phones’ numbers have been removed completely for privacy reasons. Private phone numbers aren’t being redacted to the same extent as government cell phones. Government desk phones are redacted. Their response is particularly strange, because IL FOIA says: "disclosure of information that bears on the public duties of public employees and officials shall not be considered an invasion of personal privacy." With the help of my lawyer, I sent an email to Chicago explaining this... and never received a response. Time to appeal! Request for Review In many IL FOIA rejections, the following text is written at the bottom: You have a right of review of this denial by the IL Attorney General's Public Access Counselor, who can be contacted at 500 S. Second St., Springfield, IL 62706 or by telephone at (217)(558)-0486. You may also seek judicial review of a denial under 5 ILCS 140/11 of FOIA. I went the first route by submitting a Request for Review (RFR). The RFR letter can be boiled down to: They stopped responding. Redaction favors the government personnel’s privacy over individuals’, despite FOIA statute. Their response to the original request took ten days. Seven Months Later Turns out RFRs are very, very slow. So - seven months later, I received a RFR closing letter with a non-binding opinion saying that Chicago should send the records I requested. Their reason mostly boils down to Chicago not giving sufficient reason to call the request unduly burdensome. A long month later - August 11, 2015: - Chicago responds with this, saying that my original request was for VoIP server logs, which Chicago doesn’t have: [H]e requested "VoIP server logs," which the Department has established it does not possess. As a result, the City respectfully disagrees with your direction to produce records showing telephone numbers, as there is not an outstanding FOIA request for responsive records in the possession of the Department. Sure enough, his phone is pretty ancient: Image Source Do Over And so after nine months of what felt like wasted effort, I submitted another request that same day: Please attach [...] phone numbers dialed between the dates of 11/24/14 and 12/04/14 for [the Office of the Mayor] Two weeks later - this time with an on-time extension letter - I'm sent another file that looks like this: The exact same file. They even sent the same rejection reasons! Lawsuit On 12/2/2015, Loevy & Loevy filed suit against Chicago’s DoIT. The summary of the complaint is that we disagree with their claim that to review and redact the phone records would be extremely unduly burdensome. My part in this was waiting while my lawyer did all the work. I wasn't really involved in this part, so there's really not much for me to write about. Lawsuit conclusion Six months later, on May 11, 2016, the city settled and gave me four pages of phone logs - most of which were still redacted. Some battles, eh? Interesting bits from the court document: ...DoIT and its counsel became aware that, in its August 24, 2015 response, DoIT had inadvertently misidentified the universe of responsive numbers. DoIT identified approximately 130 additional phone numbers dialed from the phones dialed within Suite 507 of City Hall, bringing the total to 171 ...FOIA only compels the production of listed numbers belonging to businesses, governmental agencies and other entities, and only those numbers which are not work-issued cell phones. ...DoIT asserted that compliance with plaintiff request was overly burdensome pursuant to Section 3(g) of FOIA. On those grounds —rather than provide no numbers at all — DoIT redacted the last four digits of all phone numbers provided ...in other words, they "googled" each number to determine whether that number was publically listed, and, if so, to whom it belonged. This resulted in the identification of 57 out of the 137 responsive numbers... Lawsuit Records All in all, the phone records contained: 171 unique phone log entries: 57 unredacted and 114 redacted. 32 unique unredacted phone numbers. 44 unique redacted phone numbers. From there, there really wasn't much to work with. Most of the phone calls were day-to-day calls to places like flower shops, doctors and restaurants. Still, some numbers are interesting: A four-hour hotel: Prestige Club: Aura Investigative services: Statewide Investigative Services and Kennealy & O'Callaghanh Michael Madigan Data: Lawsuit Records Going deeper With all of that done – a year and a half in total for one request - I wasn’t feeling satisfied and dug deeper. This time, I started approaching it methodically to build a toolchain of sorts. So, to determine whether the same length of time could be requested without another lawsuit: Please provide me with the to/from telephone numbers, duration, time and date of all calls dialed from 121 N La Salle St #507, Chicago, IL 60602 for the below dates. April 6-9, 2015 November 23-25, 2015 And sure enough, two weeks later, I received two pdfs with phone records – this time with times, dates, from number and call length! Much faster now! Still, it’s lame that they’re still redacting a lot, and there wasn't anything interesting in these records. Data: Long Distance, Local Full year of records How about for a full year for a small set of previously released phone numbers? Please provide to me, for the year of 2014, the datetime and dialed-from number for the below numbers from the [Office of the Mayor] (312) 942-1222 [Statewide Investigative Services] (505) 747-4912 [Azura Investigations] (708) 272-6000 [Aura - Prestige Club] (773) 783-4855 [Kennealy & O'Callaghanh] (312) 606-9999 [Siam Rice] (312) 553-5698 [Some guy named Norman Kwong] Again, success! Siam Rice: 55 calls! Statewide Investigative Services: 8 calls Keannealy & O'Callaghanh: 10 calls Some guy named Norman Kwong: 3 calls (Interestingly enough, they didn't give me the prestige club phone numbers. Heh!) Data: Full year records Full year of records - City hall And finally, another request for previously released numbers – across all of city hall in 2014 and 2015: The phone numbers, names and call times to and from the phone numbers listed below during 2014 and 2015 [within city hall] (312) 942-1222 [Statewide Investigative Services] (773) 783-4855 [Kennealy & O'Callaghanh] (312) 346-4321 [Madigan & Getzendanner] (773) 581-8000 [Michael Madigan] (708) 272-6000 [Aura - Prestige Club] Data: City hall full year This means that a few methods of retrieving phone records are possible: A week's worth of (mostly redacted) records from a high profile office. A phone records through an office as big as the City Hall. The use of requesting unredacted phone numbers for future requests. Phone directory woes Of the 27 distinct Chicago phone numbers found within the last request’s records, only five of them could be resolved to a phone number found in Chicago’s phone directory: DEAL, AARON J KLINZMAN, GRANT T EMANUEL, RAHM NELSON, ASHLI RENEE MAGANA, JASMINE M This is a problem that I haven't solved for yet, but it should be easy enough by requesting a full phone directory from Chicago's DoIT. Anyone up for that challenge? ;) Emails? This probably deserves its own blog post, but I wanted tease it a bit, because it leads into other posts. I sent this request with the presumption that the redaction of emails would take a very long time: From all emails sent from [the Office of the Mayor] between 11/24/14 and 12/04/14, please provide me to all domain names for all email addresses in the to/cc/bcc. From each email, include the sender's address and sent times. Two months later, I received a 1,751 page document with full email addresses for to, from cc, and bcc, including the times of 18,860 separate emails to and from the mayor’s office. Neat - it only took about a year and a half to figure out how to parse the damn thing, though.... Interestingly, the mayor's email address isn't in there that often... Data: Email Metadata What's next? This whole process was a complete and total pain. The usefulness of knowing the ongoings of our government - especially at its highest levels - are critical for ensuring that our government is open and honest. It really shouldn't have been this difficult, but it was. The difficulties led me down an interesting path of doing many similar requests - and boy are there stories. Next post: The time Seattle accidentally sent me 30m emails for ~$30. Code Scraping Chicago's phone directory Parsing 1,700 page email pdf Tags: foia, phone, email, data, chicago

over a year ago • 26 votes

over a year ago • 32 votes

More in programming

AmigaGuide Reference Library

As I slowly but surely work towards the next release of my setcmd project for the Amiga (see the 68k branch for the gory details and my total noob-like C flailing around), I’ve made heavy use of documentation in the AmigaGuide format. Despite it’s age, it’s a great Amiga-native format and there’s a wealth of great information out there for things like the C API, as well as language guides and tutorials for tools like the Installer utility - and the AmigaGuide markup syntax itself. The only snag is, I had to have access to an Amiga (real or emulated), or install one of the various viewer programs on my laptops. Because like many, I spend a lot of time in a web browser and occasionally want to check something on my mobile phone, this is less than convenient. Fortunately, there’s a great AmigaGuideJS online viewer which renders AmigaGuide format documents using Javascript. I’ve started building up a collection of useful developer guides and other files in my own reference library so that I can access this documentation whenever I’m not at my Amiga or am coding in my “modern” dev environment. It’s really just for my own personal use, but I’ll be adding to it whenever I come across a useful piece of documentation so I hope it’s of some use to others as well! And on a related note, I now have a “unified” code-base so that SetCmd now builds and runs on 68k-based OS 3.x systems as well as OS 4.x PPC systems like my X5000. I need to: Tidy up my code and fix all the “TODO” stuff Update the Installer to run on OS 3.x systems Update the documentation Build a new package and upload to Aminet/OS4Depot Hopefully I’ll get that done in the next month or so. With the pressures of work and family life (and my other hobbies), progress has been a lot slower these last few years but I’m still really enjoying working on Amiga code and it’s great to have a fun personal project that’s there for me whenever I want to hack away at something for the sheer hell of it. I’ve learned a lot along the way and the AmigaOS is still an absolute joy to develop for. I even brought my X5000 to the most recent Kickstart Amiga User Group BBQ/meetup and had a fun day working on the code with fellow Amigans and enjoying some classic gaming & demos - there was also a MorphOS machine there, which I think will be my next target as the codebase is slowly becoming more portable. Just got to find some room in the “retro cave” now… This stuff is addictive :)

20 hours ago • 2 votes

An Analysis of Links From The White House’s “Wire” Website

A little while back I heard about the White House launching their version of a Drudge Report style website called White House Wire. According to Axios, a White House official said the site’s purpose was to serve as “a place for supporters of the president’s agenda to get the real news all in one place”. So a link blog, if you will. As a self-professed connoisseur of websites and link blogs, this got me thinking: “I wonder what kind of links they’re considering as ‘real news’ and what they’re linking to?” So I decided to do quick analysis using Quadratic, a programmable spreadsheet where you can write code and return values to a 2d interface of rows and columns. I wrote some JavaScript to: Fetch the HTML page at whitehouse.gov/wire Parse it with cheerio Select all the external links on the page Return a list of links and their headline text In a few minutes I had a quick analysis of what kind of links were on the page: This immediately sparked my curiosity to know more about the meta information around the links, like: If you grouped all the links together, which sites get linked to the most? What kind of interesting data could you pull from the headlines they’re writing, like the most frequently used words? What if you did this analysis, but with snapshots of the website over time (rather than just the current moment)? So I got to building. Quadratic today doesn’t yet have the ability for your spreadsheet to run in the background on a schedule and append data. So I had to look elsewhere for a little extra functionality. My mind went to val.town which lets you write little scripts that can 1) run on a schedule (cron), 2) store information (blobs), and 3) retrieve stored information via their API. After a quick read of their docs, I figured out how to write a little script that’ll run once a day, scrape the site, and save the resulting HTML page in their key/value storage. From there, I was back to Quadratic writing code to talk to val.town’s API and retrieve my HTML, parse it, and turn it into good, structured data. There were some things I had to do, like: Fine-tune how I select all the editorial links on the page from the source HTML (I didn’t want, for example, to include external links to the White House’s social pages which appear on every page). This required a little finessing, but I eventually got a collection of links that corresponded to what I was seeing on the page. Parse the links and pull out the top-level domains so I could group links by domain occurrence. Create charts and graphs to visualize the structured data I had created. Selfish plug: Quadratic made this all super easy, as I could program in JavaScript and use third-party tools like tldts to do the analysis, all while visualizing my output on a 2d grid in real-time which made for a super fast feedback loop! Once I got all that done, I just had to sit back and wait for the HTML snapshots to begin accumulating! It’s been about a month and a half since I started this and I have about fifty days worth of data. The results? Here’s the top 10 domains that the White House Wire links to (by occurrence), from May 8 to June 24, 2025: youtube.com (133) foxnews.com (72) thepostmillennial.com (67) foxbusiness.com (66) breitbart.com (64) x.com (63) reuters.com (51) truthsocial.com (48) nypost.com (47) dailywire.com (36) From the links, here’s a word cloud of the most commonly recurring words in the link headlines: “trump” (343) “president” (145) “us” (134) “big” (131) “bill” (127) “beautiful” (113) “trumps” (92) “one” (72) “million” (57) “house” (56) The data and these graphs are all in my spreadsheet, so I can open it up whenever I want to see the latest data and re-run my script to pull the latest from val.town. In response to the new data that comes in, the spreadsheet automatically parses it, turn it into links, and updates the graphs. Cool! If you want to check out the spreadsheet — sorry! My API key for val.town is in it (“secrets management” is on the roadmap). But I created a duplicate where I inlined the data from the API (rather than the code which dynamically pulls it) which you can check out here at your convenience. Email · Mastodon · Bluesky

10 hours ago • 2 votes

That boolean should probably be something else

One of the first types we learn about is the boolean. It's pretty natural to use, because boolean logic underpins much of modern computing. And yet, it's one of the types we should probably be using a lot less of. In almost every single instance when you use a boolean, it should be something else. The trick is figuring out what "something else" is. Doing this is worth the effort. It tells you a lot about your system, and it will improve your design (even if you end up using a boolean). There are a few possible types that come up often, hiding as booleans. Let's take a look at each of these, as well as the case where using a boolean does make sense. This isn't exhaustive—[1]there are surely other types that can make sense, too. Datetimes A lot of boolean data is representing a temporal event having happened. For example, websites often have you confirm your email. This may be stored as a boolean column, is_confirmed, in the database. It makes a lot of sense. But, you're throwing away data: when the confirmation happened. You can instead store when the user confirmed their email in a nullable column. You can still get the same information by checking whether the column is null. But you also get richer data for other purposes. Maybe you find out down the road that there was a bug in your confirmation process. You can use these timestamps to check which users would be affected by that, based on when their confirmation was stored. This is the one I've seen discussed the most of all these. We run into it with almost every database we design, after all. You can detect it by asking if an action has to occur for the boolean to change values, and if values can only change one time. If you have both of these, then it really looks like it is a datetime being transformed into a boolean. Store the datetime! Enums Much of the remaining boolean data indicates either what type something is, or its status. Is a user an admin or not? Check the is_admin column! Did that job fail? Check the failed column! Is the user allowed to take this action? Return a boolean for that, yes or no! These usually make more sense as an enum. Consider the admin case: this is really a user role, and you should have an enum for it. If it's a boolean, you're going to eventually need more columns, and you'll keep adding on other statuses. Oh, we had users and admins, but now we also need guest users and we need super-admins. With an enum, you can add those easily. enum UserRole { User, Admin, Guest, SuperAdmin, } And then you can usually use your tooling to make sure that all the new cases are covered in your code. With a boolean, you have to add more booleans, and then you have to make sure you find all the places where the old booleans were used and make sure they handle these new cases, too. Enums help you avoid these bugs. Job status is one that's pretty clearly an enum as well. If you use booleans, you'll have is_failed, is_started, is_queued, and on and on. Or you could just have one single field, status, which is an enum with the various statuses. (Note, though, that you probably do want timestamp fields for each of these events—but you're still best having the status stored explicitly as well.) This begins to resemble a state machine once you store the status, and it means that you can make much cleaner code and analyze things along state transition lines. And it's not just for storing in a database, either. If you're checking a user's permissions, you often return a boolean for that. fn check_permissions(user: User) -> bool { false // no one is allowed to do anything i guess } In this case, true means the user can do it and false means they can't. Usually. I think. But you can really start to have doubts here, and with any boolean, because the application logic meaning of the value cannot be inferred from the type. Instead, this can be represented as an enum, even when there are just two choices. enum PermissionCheck { Allowed, NotPermitted(reason: String), } As a bonus, though, if you use an enum? You can end up with richer information, like returning a reason for a permission check failing. And you are safe for future expansions of the enum, just like with roles. You can detect when something should be an enum a proliferation of booleans which are mutually exclusive or depend on one another. You'll see multiple columns which are all changed at the same time. Or you'll see a boolean which is returned and used for a long time. It's important to use enums here to keep your program maintainable and understandable. Conditionals But when should we use a boolean? I've mainly run into one case where it makes sense: when you're (temporarily) storing the result of a conditional expression for evaluation. This is in some ways an optimization, either for the computer (reuse a variable[2]) or for the programmer (make it more comprehensible by giving a name to a big conditional) by storing an intermediate value. Here's a contrived example where using a boolean as an intermediate value. fn calculate_user_data(user: User, records: RecordStore) { // this would be some nice long conditional, // but I don't have one. So variables it is! let user_can_do_this: bool = (a && b) && (c || !d); if user_can_do_this && records.ready() { // do the thing } else if user_can_do_this && records.in_progress() { // do another thing } else { // and something else! } } But even here in this contrived example, some enums would make more sense. I'd keep the boolean, probably, simply to give a name to what we're calculating. But the rest of it should be a match on an enum! * * * Sure, not every boolean should go away. There's probably no single rule in software design that is always true. But, we should be paying a lot more attention to booleans. They're sneaky. They feel like they make sense for our data, but they make sense for our logic. The data is usually something different underneath. By storing a boolean as our data, we're coupling that data tightly to our application logic. Instead, we should remain critical and ask what data the boolean depends on, and should we maybe store that instead? It comes easier with practice. Really, all good design does. A little thinking up front saves you a lot of time in the long run. I know that using an em-dash is treated as a sign of using LLMs. LLMs are never used for my writing. I just really like em-dashes and have a dedicated key for them on one of my keyboard layers. ↩ This one is probably best left to the compiler. ↩

yesterday • 3 votes

Implementation of optimized vector of strings in C++ in SumatraPDF

SumatraPDF is a fast, small, open-source PDF reader for Windows, written in C++. This article describes how I implemented StrVec class for efficiently storing multiple strings. Much ado about the strings Strings are among the most used types in most programs. Arrays of strings are also used often. I count ~80 uses of StrVec in SumatraPDF code. This article describes how I implemented an optimized array of strings in SumatraPDF C++ code . No STL for you Why not use std::vector<std::string>? In SumatraPDF I don’t use STL. I don’t use std::string, I don’t use std::vector. For me it’s a symbol of my individuality, and my belief in personal freedom. As described here, minimum size of std::string on 64-bit machines is 32 bytes for msvc / gcc and 24 bytes for short strings (15 chars for msvc / gcc, 22 chars for clang). For longer strings we have more overhead: 32⁄24 bytes for the header memory allocator overhead allocator metadata padding due to rounding allocations to at least 16 bytes There’s also std::vector overhead: for fast appends (push()) std::vectorimplementations over-allocated space Longer strings are allocated at random addresses so they can be spread out in memory. That is bad for cache locality and that often cause more slowness than executing lots of instructions. Design and implementation of StrVec StrVec (vector of strings) solves all of the above: per-string overhead of only 8 bytes strings are laid out next to each other in memory StrVec High level design of StrVec: backing memory is allocated in singly-linked pages similar to std::vector, we start with small page and increase the size of the page. This strikes a balance between speed of accessing a string at random index and wasted space unlike std::vector we don’t reallocate memory (most of the time). That saves memory copy when re-allocating backing space Here’s all there is to StrVec: struct StrVec { StrVecPage* first = nullptr; int nextPageSize = 256; int size = 0; } size is a cached number of strings. It could be calculated by summing the size in all StrVecPages. nextPageSize is the size of the next StrVecPage. Most array implementation increase the size of next allocation by 1.4x - 2x. I went with the following progression: 256 bytes, 1k, 4k, 16k, 32k and I cap it at 64k. I don’t have data behind those numbers, they feel right. Bigger page wastes more space. Smaller page makes random access slower because to find N-th string we need to traverse linked list of StrVecPage. nextPageSize is exposed to allow the caller to optimize use. E.g. if it expects lots of strings, it could set nextPageSize to a large number. StrVecPage Most of the implementation is in StrVecPage. The big idea here is: we allocate a block of memory strings are allocated from the end of memory block at the beginning of the memory block we build and index of strings. For each string we have: u32 size u32 offset of the string within memory block, counting from the beginning of the block The layout of memory block is: StrVecPage struct { size u32; offset u32 } [] … not yet used space strings This is StrVecPage: struct StrVecPage { struct StrVecPage* next; int pageSize; int nStrings; char* currEnd; } next is for linked list of pages. Since pages can have various sizes we need to record pageSize. nStrings is number of strings in the page and currEnd points to the end of free space within page. Implementing operations Appending a string Appending a string at the end is most common operation. To append a string: we calculate how much memory inside a page it’ll need: str::Len(string) + 1 + sizeof(u32) + sizeof(u32). +1 is for 0-termination for compatibility with C APIs that take char*, and 2xu32 for size and offset. If we have enough space in last page, we add size and offset at the end of index and append a string from the end i.e. `currEnd - (str::Len(string) + 1). If there is not enough space in last page, we allocate new page We can calculate how much space we have left with: int indexEntrySize = sizeof(u32) + sizeof(u32); // size + offset char* indexEnd = (char*)pageStart + sizeof(StrVecPage) + nStrings*indexEntrySize int nBytesFree = (int)(currEnd - indexEnd) Removing a string Removing a string is easy because it doesn’t require moving memory inside StrVecPage. We do nStrings-- and move index values of strings after the removed string. I don’t bother freeing the string memory within a page. It’s possible but complicated enough I decided to skip it. You can compact StrVec to remove all overhead. If you do not care about preserving order of strings after removal, I haveRemoveAtFast() which uses a trick: instead of copying memory of all index values after removed string, I copy a single index from the end into a slot of the string being removed. Replacing a string or inserting in the middle Replacing a string or inserting a string in the middle is more complicated because there might not be enough space in the page for the string. When there is enough space, it’s as simple as append. When there is not enough space, I re-use the compacting capability: I compact all existing pages into a single page with extra space for the string and some extra space as an optimization for multiple inserts. Iteration A random access requires traversing a linked list. I think it’s still fast because typically there aren’t many pages and we only need to look at a single nStrings value. After compaction to a single page, random access is as fast as it could ever be. C++ iterator is optimized for sequential access: struct iterator { const StrVec* v; int idx; // perf: cache page, idxInPage from prev iteration int idxInPage; StrVecPage* page; } We cache the current state of iteration as page and idxInPage. To advance to next string we advance idxInPage. If it exceeds nStrings, we advance to page->next. Optimized search Finding a string is as optimized as it could be without a hash table. Typically to compare char* strings you need to call str::Eq(s, s2) for every string you compare it to. That is a function call and it has to touch s2 memory. That is bad for performance because it blows the cache. In StrVec I calculate length of the string to find once and then traverse the size / offset index. Only when size is different I have to compare the strings. Most of the time we just look at offset / size in L1 cache, which is very fast. Compacting If you know that you’ll not be adding more strings to StrVec you can compact all pages into a single page with no overhead of empty space. It also speeds up random access because we don’t have multiple pages to traverse to find the item and a given index. Representing a nullptr char* Even though I have a string class, I mostly use char* in SumatraPDF code. In that world empty string and nullptr are 2 different things. To allow storing nullptr strings in StrVec (and not turning them into empty strings on the way out) I use a trick: a special u32 value kNullOffset represents nullptr. StrVec is a string pool allocator In C++ you have to track the lifetime of each object: you allocate with malloc() or new when you no longer need to object, you call free() or delete However, the lifetime of allocations is often tied together. For example in SumatraPDF an opened document is represented by a class. Many allocations done to construct that object last exactly as long as the object. The idea of a pool allocator is that instead of tracking the lifetime of each allocation, you have a single allocator. You allocate objects with the same lifetime from that allocator and you free them with a single call. StrVec is a string pool allocator: all strings stored in StrVec have the same lifetime. Testing In general I don’t advocate writing a lot of tests. However, low-level, tricky functionality like StrVec deserves decent test coverage to ensure basic functionality works and to exercise code for corner cases. I have 360 lines of tests for ~700 lines of of implementation. Potential tweaks and optimization When designing and implementing data structures, tradeoffs are aplenty. Interleaving index and strings I’m not sure if it would be faster but instead of storing size and offset at the beginning of the page and strings at the end, we could store size / string sequentially from the beginning. It would remove the need for u32 of offset but would make random access slower. Varint encoding of size and offset Most strings are short, under 127 chars. Most offsets are under 16k. If we stored size and offset as variable length integers, we would probably bring down average per-string overhead from 8 bytes to ~4 bytes. Implicit size When strings are stored sequentially size is implicit as difference between offset of the string and offset of next string. Not storing size would make insert and set operations more complicated and costly: we would have to compact and arrange strings in order every time. Storing index separately We could store index of size / offset in a separate vector and use pages to only allocate string data. This would simplify insert and set operations. With current design if we run out of space inside a page, we have to re-arrange memory. When offset is stored outside of the page, it can refer to any page so insert and set could be as simple as append. The evolution of StrVec The design described here is a second implementation of StrVec. The one before was simply a combination of str::Str (my std::string) for allocating all strings and Vec<u32> (my std::vector) for storing offset index. It had some flaws: appending a string could re-allocate memory within str::Str. The caller couldn’t store returned char* pointer because it could be invalidated. As a result the API was akward and potentially confusing: I was returning offset of the string so the string was str::Str.Data() + offset. The new StrVec doesn’t re-allocate on Append, only (potentially) on InsertAt and SetAt. The most common case is append-only which allows the caller to store the returned char* pointers. Before implementing StrVec I used Vec<char*>. Vec is my version of std::vector and Vec<char*> would just store pointer to individually allocated strings. Cost vs. benefit I’m a pragmatist: I want to achieve the most with the least amount of code, the least amount of time and effort. While it might seem that I’m re-implementing things willy-nilly, I’m actually very mindful of the cost of writing code. Writing software is a balance between effort and resulting quality. One of the biggest reasons SumatraPDF so popular is that it’s fast and small. That’s an important aspect of software quality. When you double click on a PDF file in an explorer, SumatraPDF starts instantly. You can’t say that about many similar programs and about other software in general. Keeping SumatraPDF small and fast is an ongoing focus and it does take effort. StrVec.cpp is only 705 lines of code. It took me several days to complete. Maybe 2 days to write the code and then some time here and there to fix the bugs. That being said, I didn’t start with this StrVec. For many years I used obvious Vec<char*>. Then I implemented somewhat optimized StrVec. And a few years after that I implemented this ultra-optimized version. References SumatraPDF is a small, fast, multi-format (PDF/eBook/Comic Book and more), open-source reader for Windows. The implementation described here: StrVec.cpp, StrVec.h, StrVec_ut.cpp By the time you read this, the implementation could have been improved.

yesterday • 1 votes

The parental dead end of consent morality

Consent morality is the idea that there are no higher values or virtues than allowing consenting adults to do whatever they please. As long as they're not hurting anyone, it's all good, and whoever might have a problem with that is by definition a bigot. This was the overriding morality I picked up as a child of the 90s. From TV, movies, music, and popular culture. Fly your freak! Whatever feels right is right! It doesn't seem like much has changed since then. What a moral dead end. I first heard the term consent morality as part of Louise Perry's critique of the sexual revolution. That in the context of hook-up culture, situationships, and falling birthrates, we have to wrestle with the fact that the sexual revolution — and it's insistence that, say, a sky-high body count mustn't be taboo — has led society to screwy dating market in the internet age that few people are actually happy with. But the application of consent morality that I actually find even more troubling is towards parenthood. As is widely acknowledged now, we're in a bit of a birthrate crisis all over the world. And I think consent morality can help explain part of it. I was reminded of this when I posted a cute video of a young girl so over-the-moon excited for her dad getting off work to argue that you'd be crazy to trade that for some nebulous concept of "personal freedom". Predictably, consent morality immediately appeared in the comments: Some people just don't want children and that's TOTALLY OKAY and you're actually bad for suggesting they should! No. It's the role of a well-functioning culture to guide people towards The Good Life. Not force, but guide. Nobody wants to be convinced by the morality police at the pointy end of a bayonet, but giving up on the whole idea of objective higher values and virtues is a nihilistic and cowardly alternative. Humans are deeply mimetic creatures. It's imperative that we celebrate what's good, true, and beautiful, such that these ideals become collective markers for morality. Such that they guide behavior. I don't think we've done a good job at doing that with parenthood in the last thirty-plus years. In fact, I'd argue we've done just about everything to undermine the cultural appeal of the simple yet divine satisfaction of child rearing (and by extension maligned the square family unit with mom, dad, and a few kids). Partly out of a coordinated campaign against the family unit as some sort of trad (possibly fascist!) identity marker in a long-waged culture war, but perhaps just as much out of the banal denigration of how boring and limiting it must be to carry such simple burdens as being a father or a mother in modern society. It's no wonder that if you incessantly focus on how expensive it is, how little sleep you get, how terrifying the responsibility is, and how much stress is involved with parenthood that it doesn't seem all that appealing! This is where Jordan Peterson does his best work. In advocating for the deeper meaning of embracing burden and responsibility. In diagnosing that much of our modern malaise does not come from carrying too much, but from carrying too little. That a myopic focus on personal freedom — the nights out, the "me time", the money saved — is a spiritual mirage: You think you want the paradise of nothing ever being asked of you, but it turns out to be the hell of nobody ever needing you. Whatever the cause, I think part of the cure is for our culture to reembrace the virtue and the value of parenthood without reservation. To stop centering the margins and their pathologies. To start centering the overwhelming middle where most people make for good parents, and will come to see that role as the most meaningful part they've played in their time on this planet. But this requires giving up on consent morality as the only way to find our path to The Good Life. It involves taking a moral stance that some ways of living are better than other ways of living for the broad many. That parenthood is good, that we need more children both for the literal survival of civilization, but also for the collective motivation to guard against the bad, the false, and the ugly. There's more to life than what you feel like doing in the moment. The worst thing in the world is not to have others ask more of you. Giving up on the total freedom of the unmoored life is a small price to pay for finding the deeper meaning in a tethered relationship with continuing a bloodline that's been drawn for hundreds of thousands of years before it came to you. You're never going to be "ready" before you take the leap. If you keep waiting, you'll wait until the window has closed, and all you see is regret. Summon a bit of bravery, don't overthink it, and do your part for the future of the world. It's 2.1 or bust, baby!

2 days ago • 2 votes

New here?