Automatic transformation of hyphens and dashes

The following has been moved from en:Talk:American and British English Differences:

Until I came to Wikipedia I had never come across the use of two successive hyphens to mean a dash. Is this an Americanism, or an internet thing, or a Wikipedianism, or what? GrahamN 01:48 Mar 15, 2003 (UTC)

It's none of the above. it's a universal English convention for signifying an em dash (a long dash, as opposed to en dash). Arthur 02:00 Mar 15, 2003 (UTC)~
Sorry, but that's just not true. It's a non-universal and (in many people's view) semi-illiterate way of signifing an em dash—which looks like that thing there. To type an em dash, use "& # 8212 ;" (without the spaces).
With a little luck, we might be able to persuafe the 'pedia software developers to provide us with a way of using the em dash more conveniently—perhaps by making Wiki markup do an automatic mapping of " -- " to —. Tannin 02:08 Mar 15, 2003 (UTC)
semi-illiterate is a bit harsh -- I like to think of myself as at least semi-literate. Most of us just use two hyphens because there is no ASCII em dash and we don't know or can't remember the relatively new # 8212 code. Nothing to do with American/British. It's quicker to type two hyphens too. -- Derek Ross 02:40 Mar 15, 2003 (UTC)

Hey, I was respectfully polite about it. I carefully added semi- in front of illiterate. :)
But all jokes aside, it is a very ugly way to have the pages look. I prefer the <space - space> method, but I freely admit that that is only marginally better. HTML's lack of a proper em dash is a real problem. Whether we use <space - space> or <space - - space>, or even the current mish-mash of both, it matters little. The real answer is to use proper em dashes. (Or, at a pinch, I'd settle for an en dash.)
Tarquin has proposed that the wiki software map all instances of " -- " to "—", just as it maps '' to italics. This strikes me as a really excellent idea. Simple, easy to do, produces a professional looking result. (The real professionals use a microspace on either side of the em dash—look at any printed book for examples— but we can't do that in HTML. The are microspace characters defined in the HTML standard, but they are impratical to use because of cross-browser implementation issues.) Tannin
Returning to the question of UK/US differences, I've lived in the UK all my life and until I came upon the Wikipedia I'd never seen this usage, not even by hemi-demi-semi-literate people. Nor had I ever heard of an em-dash or an en-dash. As far as I remember we were always quite content with hyphens (typed-like-this), and dashes (typed - like - this). GrahamN 03:21 Mar 15, 2003 (UTC)

Semi-illiterate is extravagently harsh. (and, GrahamN, although I hope you take no offense, all type-setters HAVE heard of em dashes. This is your educational failing! Someone cheated you!) Semi-illiterate is also wrong!. All the english speaking authorities in the world recognize this. Look at the international Microsoft Word's interpretation, for example, or that of the Gutenberg texts. They all agree with me because modern computerese codes like "# 8322" are not decisive and do not indicate the usage of ascii. The double dash has been a universal signification for several hundred years (long, obviously, before computer ascii codes)! (I'm not illiterate. I have a Ph.D. from a major US university, UCSD--the same place the emperor of Japan studied oceanography and the same place I learned more about em dashes than other people.) Arthur 03:28 Mar 15, 2003 (UTC)
Hear, hear. Certainly, "--" was what I was taught in typing class, along with the two-spaces-after-a-period thing that causes so many problems. But actually, the One True Way to do this is three ascii hyphens for an em dash, and two for an en dash, as Don Knuth has taught us. But I agree with Tannin that -- should be mapped to an em dash. Not enough TeXers out there to make --- work.
Graham, if you are interested, you'll find excellent short articles of relevance here and here. Also much else of interest if you are into broader issues of web design. One of my favourite sites. Tannin
Thanks for the links, Tannin! I edited them for you since the URLs have changed slightly. Anyhoo... Beautypersoni 19:31, 21 May 2010 (UTC)
PS No professionally printed publication uses "--" or "-". The only reason they existed at all is because of the technical limitations of mechanical typewriters. Tannin
Which is a perfectly good reason, of course. -- Derek

Curiously, I never came across the -- until I began to teach American students either. Nobody I know who isn't American uses it. Indeed one professor proverbially hit the roof when he got a nine page essay that had 66 of them (he counted them all in fury!) and told his class full of Americans that he would strangle the next American student who handed in an essay with "those bloody things". "They are barred. BARRED. Get the message. Barred" he bellowed at one student who tried to explain how she had been told in her American college that they were supposed to use them. And my (American) editor hates them with a vengeance and almost has an orgasm at the pleasure of striking them out, calling them "those fucking horrible turds". I can safely say, having just completed 500,000 words in two books that there is not a single solitary one in either text, thus denying Maroleen the orgasmic pleasure of striking them out. (Maybe I'll put in a few on the last page after all; I'll probably be able to hear her screams from NY all the way to Dublin!!!) STÓD/ÉÍRE 04:14 Mar 15, 2003 (UTC)

I'm not American and I've used double hyphens to represent dashes for at least twenty years. I never realised that anyone would be upset by it though. I just started using it for the typewriter/ASCII limitation reasons. It makes sense when there's no alternative. I like the idea of automapping them to dashes though. Real dashes would be nice without having to type character codes. They should be surrounded by half-spaces if possible though. -- Derek Ross 07:21 16 Mar 2003 (UTC)

Of course no publication uses "--". That was precisely the point! As we said, they "signify" em dashes and so are changed by typesetters into em dashes. Also, do I need remind people that typewriters still exist, and people still represent em dashes on them? Scientific journals still require that of authors submitting articles on paper or as ascii files. That way the typesetters cannot be confused.
And how do you suppose em dashes are represented by English authors with typewriters? On the second page of Fellowship of the Ring (I figure pretty much all of us will have that), the Gaffer thinks that Bilbo is "well-spoken" (hyphen). Then he says of some that, "They fool about with boats on that big river—and that isn't natural" (em dash). Tolkien did not have a computer. And Tolkien was apparently not satisfied using just hyphens. And the code for "—" was not yet invented. How did he work this magic with only paper, carbon paper, and a typewriter? How did his typesetter know to make a long dash?
I'll tell you how: Tolkien typed (or wrote with a pen) "They fool about with boats on that big river--and that isn't natural.", and then put his manuscript into an envelope with postage, and then mailed the precious thing to his publisher. Tolkien wrote "--" because "--" signifies an em dash. (I suppose the screeching professor would have screamed as usual, but how nice she wasn't at Oxford.) If you can type an em dash, do so. It's easy with modern computers. If you cannot type an em dash, then signify one with "--", because that's how em dashes are signified (or maybe "---". One person above made me doubt myself).
Oh. Maybe i'm being unclear. Please correct all instances of "--". Change them to the code for "—", for that's what they should be on wikipedia. They only "signify" em dashes (i.e., they are a sign that an em dash should be substituted). Oh. And Wikipedia prefers the code for — to the code for —. Arthur 21:11 Mar 15, 2003 (UTC)
In TeX, and probably other typesetting tools, an en-dash is represented by --, and an em-dash by ---. A hyphen is represented by -. 17:31, 7 Feb 2004 (UTC)

That may be for the best; obviously it wouldn't be appropriate for code samples (for(i=100; i>0; i—)?), but this'll be a minority usage, and can be segregated in <nowiki> or <pre> blocks. I'll do some spotchecks... —Brion VIBBER 03:15 16 Mar 2003 (UTC)

There appear to be over 6000 pages in article space on that include double dashes (cur_text rlike '[^-]--[^-]'). Obviously I'm not going to check them all. ;) Some have spaces at the ends, some don't. All that I checked (a dozen or so) were legitimate targets for replacement with em dashes, though I'm sure there are counterexamples somewhere. A couple more thoughts: in monospaced text (<pre>, <tt>, and space-at-start-of-line preformatted text) a — will often come out with a single charcell width (ex: ). This, I think, defeats the purpose; so any automatic replacement should probably explicitly exclude such regions. (<pre> should already be excluded from any wiki markup, but spaced text and <tt>s are not.) It's also worth checking how well the numerically-referenced unicode character — (&#8212;) is supported, and whether the named reference — (&mdash;) is any better supported. Lynx? Opera? Netscape (shudder) 4.x? Konqueror? --Brion VIBBER
Netscape 4.07 doesn't understand &mdash;, but renders &#8212; correctly (if there's a suitable font available). --Zundark 08:48 16 Mar 2003 (UTC)
Netscape 4.7 (shudder) doesn't understand it either. My earlier assertion that &mdash; was the preferred form was merely because it's recommended on Wikipedia:How to edit a page. It does seem, though, that numerical form will be more universally understood. Arthur

Hopefully, requiring a space each side " -- " will mean we won't catch programming language examples which shouldn't be converted. It's a pity about Netscape 4.07 -- isn't an entity like &mdash; preferable to &#8212; in the HMTL spec? -- Tarquin 15:47 16 Mar 2003 (UTC)

I don't see anything in the HTML spec preferring one over the other. Both are valid. I think that character entity references are provided because they are easier to remember, but that isn't a consideration when we're talking about automatically generated code that even editors won't see. --Zundark 16:22 16 Mar 2003 (UTC)
surrounding spaces are unconventional though. The New York Times has no spaces, for example. And the Tolkien line has no spaces. Arthur
Indeed, that would mean we won't catch several thousand uses intended to be em-dashes for the sake of probably less than a half dozen programming examples. --Brion VIBBER
I seem to learn something every day. Tannin asked me to look at any printed book to see examples of "microspaces" (or half spaces, as Derek Ross called them). I did. The first book was Elizabeth Zimmermann's Knitting Almanac (1974). To my great surprise, for I've read the whole book, there are (previously invisible to me) microspaces all over the thing. (Thanks Tannin.) The second book (Orson Scott Card's Shadow of the Hegemon, 2000) had not one--even though he's very fond of em dashes. I wonder if microspaces are on their way out? In any case, Scott's New York Times Bestselling Novel suggests that we're safe to ignore them. (Tannin gives the computer-world reasons, above.) Arthur 23:32 16 Mar 2003 (UTC)
The use of spaces with dashes is a difference between British and American English. In British English, there is a space on either side of a dash (either en or em: in fact, there is no difference between the two grammatically; the shorter one seems much nicer to me!) – in American, there is either no space, or a half-space. There is, obviously, never a space when using a hyphen. Also, when a comma follows a dash, the dash has a space only on its left side. For example: He was among us all along – O! how blinded we were –, but never did we recognise him. 17:31, 7 Feb 2004 (UTC)

As an aside, why are these pages rendered differently from plain wiki? On The Browser Whose Name Shall Not Be Uttered, wide spaces occur at the location of all markup. In the second paragraph here, for example, there are wide blank spaces surrounding both "em dash" and "en dash". (and for some reason, this page cannot be edited by IE 5.5, although i use that to edit all other pages.)

It's better if you do use names, as there are many browsers whose names ought not to be uttered and I'm not sure which you mean. ;) Rendering shouldn't be different at all, save a tiny difference in the stylesheet (gray background). Editing in IE5.5 works fine for me so long as I don't select 'edit box has full width.' --Brion VIBBER 22:13 16 Mar 2003 (UTC)
okay! Unselecting 'edit box has full width' fixed both of my problems. IE.5.5 can edit now, and (shudder) netscape 4.75 sees no odd spaces around markup. I failed to understand that I'm a different user here, with different preferences. Fixed!