The “Import and Update” Pattern

By on November 12, 2014

Most all CMS support content import, to some extent. There’s always an API and often a web service for you to fire content into a system from the outside.

But a model we see over and over that really needs to be explicitly acknowledged is that of “import and update.” This means, create new content if it doesn’t exist, but update the content in-place if it was previously created. It’s used to support instances when we’re syncing information stored inside the CMS with information stored outside the CMS.

For example, let’s say our hospital maintains its physician profiles in a separate database (for whatever reason). However, we need our physicians to have managed content objects inside the CMS, for a variety of reasons (for a list of why this is handy, see my post on proxy objects in CMS).

We can easily write a job to import our physician profiles, but what happens when they update in the source database? We don’t want to import again, we just want to update the page inside the CMS. Sure, we could delete it and recreate it, but that becomes problematic when it might change the URL, or increment a set of ID numbers, or even delete information in the CMS which is referencing that specific content object (analytics, for example).

EPiServer has a “Content Channel” architecture that handles this.  You fire a dictionary of key-value pairs (representing content properties and their values) at a web service.  You can optionally include the GUID of an existing content objects.  No GUID means EPiServer will create a new object, while data coming in with a GUID will find the corresponding page and update it with the incoming information. It essentially keeps the content object shell, but overwrites all the information in it.

With any system like this, you need to maintain a mapping between the ID outside the CMS, and the ID inside the CMS.  You need to know that Database Record #654 is updating Content ID #492. When iterating your database rows, when you run across ID #654, you know to reference ID #492 when talking to the CMS. You also need to be able to get the newly-created back out of the CMS when content is created, so you can create a mapping for it – if my CMS creates Content ID #732, I need to know this so I can reference it later.

Some CMS offer “content provider” models, which are real-time methods to “mount” other repositories.  So, instead of importing and updating this data, the CMS reaches out to our external database in real-time when required to get objects back and mock them up as first-order content objects.

This is certainly elegant and sophisticated, but it presents problems with performance, uptime of the source system, unnecessary computational overhead if the content doesn’t change much, network topology and unbroken connectivity, and the inability to extend the content with new data inside the CMS (for instance, while 90% of the information about our physicians comes from the external database, perhaps we have a couple of properties that live inside the CMS only).

I hope I see this pattern more often. EPiServer has it, eZ publish has it, and I’m sure many others. Additionally, it’s not hard to build it. If you can put together a web service, you should be able to pull it off.

It’s a handy thing to have.


Metadata Depends on Perspective

By on November 12, 2014

I’m reading The Discipline of Organizing. Early in the book, the author talks about “metadata,” which is a topic I’ve complained about before (go read those; I’ll wait). When it comes to web content management, I think it’s hard to differentiate between the “first order data” and the “metadata.” Which is which?

The author calls it ever further into question by introducing the perspective of the observer.

[…] what serves as metadata for one person or process can function as a primary resource or data for another one. Rather than being an inherent distinction, the difference between primary and associated resources is often just a decision about which resource we are focusing on in some situation. An animal specimen in a natural history museum might be a primary resource for museum visitors and scientists interested in anatomy, but information about where the specimen was collection is the primary resource for scientists interested in ecology or migration.


Things that Web Crawlers Hate

By on November 12, 2014

I wrote a web crawler in C# a couple years ago. I’ve been fiddling with it ever since.  During that time, I’ve have been forcibly introduced to the following list of things my crawler hates.

  1. Websites that return a 200 OK for everything, even if it was a 404 or a 500 or a 302 or whatever
  2. Websites that don’t use canonical URL tags
  3. Websites with self-replicating URL rabbit holes
  4. Websites that don’t use the Google Sitemap protocol (no, I don’t depend on it, but it’s awfully handy to seed the crawler with starting points – I promise that a crawl will be better with one than without one)
  5. Websites that have non-critical information carried into the page on querystring params, thus giving multiple URLs to the same content
  6. Websites with SSL that let don’t control their schemes – only allow secured pages under HTTPS, and vice-versa – so that you can’t have two URLs for the same content, just with different schemes
  7. Websites with a “print” option on every single page with a querystring param, thus giving that page two different URLs (okay, okay, this one is easy to filter for – I just always forget…)
  8. Misuse of the content-type HTTP header, because file extensions will handle it all…

Admittedly, a lot of things in this list are why crawlers are hard to write, and I should just suck it up and deal with it because this is reality. But the entire process has underscored to me how loosely we treat URLs (see the canonical URL post linked above for more on this).

We’re generally very cavalier about our URLs, and I think the web as a whole is worse off for it. URLs are a core technology, and there’s a philosophical point behind them dealing with the universal access to information. findability, and indexability.

We should be more careful. Rant over.


Is Fareed Zakaria editing his own Wikipedia page?

By on November 11, 2014

Fareed Zakaria is Apparently Editing His Own Wikipedia to Remove Plagiarism Allegations:  CNN contributor Fareed Zakaria has been accused of plagiarism.

Our Bad Media has noted several edits to his Wikipedia page which they suspect are coming from Zakaria himself. The edits are coming from New York City where Zakaria lives, they remove a lot of the plagiarism accusations, and they do a couple other things which are curious:

The account’s second edit, made the same day as the first, strengthened his bio by noting he was not just an author, but an author of THREE BOOKS. […] Finally – and most tellingly – the editor did what only a good son would: fix the name of Zakaria’s mother, from “Fatima” to “Fatma.”

Is this proof that Zakaria is editing his own Wikipedia page?  Not conclusive, certainly, but it sure is interesting.


Startup Depression

By on November 11, 2014

Startup Without Depression: A site dedicated to combatting depression in the startup world.

Depression in the startup community can be an unfortunate byproduct of the stresses of creating something from nothing. For each individual that finds the strength to speak or write publicly of their struggles, many more grapple silently with their own demons. Below is a small collection of resources that offer professional help for those battling depression and related illnesses, as well as a sampling of writing by individuals in tech willing to share their struggles.


Do Hyperlinks Change the Meaning of Content?

By on November 7, 2014

I’ve been thinking deeply about the idea of hypertext lately (reading Vannevar Bush didn’t help), and I’m curious if there’s a standard, convention, or best practice for the actual selection of words to link in a sentence? Additionally, to what extent does the existence of a link and the placement of that link affect the perceived meaning of the underlying text?

Historically, we’ve all hyperlinked the infamous “click here” phrase, and accepted that this is doesn’t make sense without the link.  But is this effect even more subtle?

Consider, in fact, hyperlink in the parenthetical aside above from the first sentence in this post.  There are four ways, I think, to link this:

I think each one of those changes the sentence, subtly — the existence of the link and its positioning has an actual effect on how the sentence is perceived.

Is the important point of this sentence that…

  1. I read something (as opposed to doing something else with it)
  2. I read Vannevar Bush in particular (as opposed to reading someone else)
  3. It “didn’t help” (as opposed to having some other effect — the “didn’t help” is sarcastic)
  4. The combination of all three
So, the link itself becomes part of the content. Whether it wants to or not, where the link is situated changes the meaning of the words.

Does the hyperlink change the emphasis of the sentence, if you were to read it out loud?  Would you mentally incorporate the hyperlink into your verbal presentation of the sentence?

(After I posted this, Arild Henrichsen made a tweet referencing Chandler Bing from Friends and his tendency to emphasize the word “be.” Funny as this is, the point is valid — Chandler aptly demonstrates how you might mentally read a sentence where the word “be” is hyperlinked).

More importantly, if the link was gone, would the sentence even make sense on its own?  That sentence depends on its link target to impart meaning.  If there was nothing to click on, the sentence would be some random non-sequiter with no context (unless, of course, you had read the Vannevar Bush post relatively recently, and were independently able to connect the two).  With the link, the reader can click through and understand exactly what I’m talking about.

But even if they never follow the link, the fact that it’s there makes them think there’s some explanation to a sentence which is otherwise random — they are aware that this requires explanation. They can choose to seek out this explanation if they want, or else they can just acknowledge that there is an explanation, and decide that they don’t care. But the hyperlink signals that further information about a given word or phrase exists, which is helpful– if someone is making an inside joke and you know this, it’s much less confusing.

Links provide context. Their existence and positioning impart and affect meaning.


“As We May Think”

By on November 7, 2014

I’ve become quite interested in Internet history lately, and I’ve run across Vannevar Bush‘s name multiple times. He was a American scientist, quite active during Word War II, and is historically known for expounding on an idea he had for a device called the “memex,” which was, in some ways, a precursor to the web itself.  (Tim Berners-Lee, in fact, has cited Bush’s work as foundational to his own work.)

Bush was vexed by the the difficulty in recording knowledge and — more importantly — recalling it, in the 1940s.  The idea of massive bound volumes frustrated him, because he was convinced that the human mind just didn’t work that way. He expounded on this in a famous 1945 essay published in The Atlantic entitled As We May Think:

Our ineptitude in getting at the record is largely caused by the artificiality of systems of indexing. When data of any sort are placed in storage, they are filed alphabetically or numerically, and information is found (when it is) by tracing it down from subclass to subclass. It can be in only one place, unless duplicates are used; one has to have rules as to which path will locate it, and the rules are cumbersome. Having found one item, moreover, one has to emerge from the system and re-enter on a new path.
Linear storage was a problem, not a solution. Bush wanted to store information the way the human mind worked:
The human mind does not work that way. It operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain.
To this end, he elaborated on his idea of the memex:
A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory. [...] It consists of a desk, and while it can presumably be operated from a distance, it is primarily the piece of furniture at which he works. [...] All this is conventional, except for the projection forward of present-day mechanisms and gadgetry. It affords an immediate step, however, to associative indexing, the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another.
That last bit is essentially the basis of hypertext.

The entire essay is worth reading. It’s so celebrated in Internet history, in fact, that a symposium was held in 1995 in honor of its 50th anniversary. In 2005, the 65th anniversary was celebrated with a panel discussion at the ACM Hypertext and Hypermedia conference (video here).

A lot of the first half is Bush discussing various photographic technologies and their possibilities for recording knowledge (he gets very close to inventing Google Glass at one point).  The bit about human thought processes and the memex comes at the very end.



Racism on Reddit

By on October 28, 2014

Hate Speech Is Drowning Reddit and No One Can Stop It: I was vaguely aware of this, but I don’t frequent many of the subs where this comes to light.

Reddit has a hate speech problem, but more than that, Reddit has a Reddit problem. A persistent, organized and particularly hateful strain of racism has emerged on the site. Enabled by Reddit’s system and permitted thanks to its fervent stance against any censorship, it has proven capable of overwhelming the site’s volunteer moderators and rendering entire subreddits unusable.

More and more, I think Reddit’s best days are behind it. The site has seemingly devolved into one big inside joke.  Stephen Colbert said much the same thing on the first episode of Slate’s new podcast, Working:

I read Reddit in the morning [pause] …which is not as useful as it used to be. I used to feel that it was more stories and less memes, photographic memes. Now it’s just been sort of consumed by Imgur photographic memes.


Facebook Increasingly Owns the News

By on October 27, 2014

How Facebook is changing the way its users consume news: Wow. I admit to always looking at the “Trending” column on the right, but I never knew it was this pervasive.

About 30 percent of adults in the United States get their news on Facebook, according to a study from the Pew Research Center. The fortunes of a news site, in short, can rise or fall depending on how it performs in Facebook’s News Feed.

This is, of course, the ultimate manifestation of Eli Pariser’s Filter Bubble, through which we continue to hear the things we want to hear.



By on October 23, 2014

Want to write the new Zork or a Choose Your Own Adventure book?  You need Twine.

Twine is an open-source tool for telling interactive, nonlinear stories. You don’t need to write any code to create a simple story with Twine, but you can extend your stories with variables, conditional logic, images, CSS, and JavaScript when you’re ready. Twine publishes directly to HTML, so you can post your work nearly anywhere. Anything you create with it is completely free to use any way you like, including for commercial purposes.

I love that this exists. There’s still a subculture that writes this stuff, and they have awards (check out the links at the bottom of that article).