Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
761 views
in Technique[技术] by (71.8m points)

regex - Which characters are allowed in hashtags

I am having a really hard time figuring out a regular expression (in C#) to validate hashtags. w simply isn't enough as special characters are missing (?, ?, ?, ?, ? for starters, but also a lot of other foreign characters.

I need to support ALL hashtags there is. Mainly from Twitter, but in the future also from other providers.

My best shot (so far) is: ^#[a-zA-Z_0-9u00C0-u02AF]+$ (C# regex)

I cannot find any decent documentation from Twitter or anyone else about this, so:

  • Does anyone know of any documentation I have missed?
  • OR does anyone know which unicode ranges I should include as valid characters for hashtags?
  • AND Can anybody tell me if there is a difference between the support of hashtags on e.g. Twitter, Instagram, Facebook, etc.?

Update I should note that C# is not the only language I need this in. Thus the need for precise specification.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

A Quick-and-Dirty Simplified Approach

Here is a nice-read from Twitter eng team:

To be fair, the Twitter team do have a standard. Even if they don't use it themselves.

The test cases and other valuable information is located at https://github.com/twitter/twitter-text/blob/master/java/src/test/java/com/twitter/twittertext/RegexTest.java. Acc. to it, the valid hashtag can be written in C# as

(^|s)([##][wu05beu05f3u05f4]*[p{L}_]+[wu05beu05f3u05f4]*)

See this regex demo

Since you want to be able to use this in any language, just note that p{L} is equal to

[A-Za-zxAAxB5xBAxC0-xD6xD8-xF6xF8-u02C1u02C6-u02D1u02E0-u02E4u02ECu02EEu0370-u0374u0376u0377u037A-u037Du037Fu0386u0388-u038Au038Cu038E-u03A1u03A3-u03F5u03F7-u0481u048A-u052Fu0531-u0556u0559u0561-u0587u05D0-u05EAu05F0-u05F2u0620-u064Au066Eu066Fu0671-u06D3u06D5u06E5u06E6u06EEu06EFu06FA-u06FCu06FFu0710u0712-u072Fu074D-u07A5u07B1u07CA-u07EAu07F4u07F5u07FAu0800-u0815u081Au0824u0828u0840-u0858u08A0-u08B4u0904-u0939u093Du0950u0958-u0961u0971-u0980u0985-u098Cu098Fu0990u0993-u09A8u09AA-u09B0u09B2u09B6-u09B9u09BDu09CEu09DCu09DDu09DF-u09E1u09F0u09F1u0A05-u0A0Au0A0Fu0A10u0A13-u0A28u0A2A-u0A30u0A32u0A33u0A35u0A36u0A38u0A39u0A59-u0A5Cu0A5Eu0A72-u0A74u0A85-u0A8Du0A8F-u0A91u0A93-u0AA8u0AAA-u0AB0u0AB2u0AB3u0AB5-u0AB9u0ABDu0AD0u0AE0u0AE1u0AF9u0B05-u0B0Cu0B0Fu0B10u0B13-u0B28u0B2A-u0B30u0B32u0B33u0B35-u0B39u0B3Du0B5Cu0B5Du0B5F-u0B61u0B71u0B83u0B85-u0B8Au0B8E-u0B90u0B92-u0B95u0B99u0B9Au0B9Cu0B9Eu0B9Fu0BA3u0BA4u0BA8-u0BAAu0BAE-u0BB9u0BD0u0C05-u0C0Cu0C0E-u0C10u0C12-u0C28u0C2A-u0C39u0C3Du0C58-u0C5Au0C60u0C61u0C85-u0C8Cu0C8E-u0C90u0C92-u0CA8u0CAA-u0CB3u0CB5-u0CB9u0CBDu0CDEu0CE0u0CE1u0CF1u0CF2u0D05-u0D0Cu0D0E-u0D10u0D12-u0D3Au0D3Du0D4Eu0D5F-u0D61u0D7A-u0D7Fu0D85-u0D96u0D9A-u0DB1u0DB3-u0DBBu0DBDu0DC0-u0DC6u0E01-u0E30u0E32u0E33u0E40-u0E46u0E81u0E82u0E84u0E87u0E88u0E8Au0E8Du0E94-u0E97u0E99-u0E9Fu0EA1-u0EA3u0EA5u0EA7u0EAAu0EABu0EAD-u0EB0u0EB2u0EB3u0EBDu0EC0-u0EC4u0EC6u0EDC-u0EDFu0F00u0F40-u0F47u0F49-u0F6Cu0F88-u0F8Cu1000-u102Au103Fu1050-u1055u105A-u105Du1061u1065u1066u106E-u1070u1075-u1081u108Eu10A0-u10C5u10C7u10CDu10D0-u10FAu10FC-u1248u124A-u124Du1250-u1256u1258u125A-u125Du1260-u1288u128A-u128Du1290-u12B0u12B2-u12B5u12B8-u12BEu12C0u12C2-u12C5u12C8-u12D6u12D8-u1310u1312-u1315u1318-u135Au1380-u138Fu13A0-u13F5u13F8-u13FDu1401-u166Cu166F-u167Fu1681-u169Au16A0-u16EAu16F1-u16F8u1700-u170Cu170E-u1711u1720-u1731u1740-u1751u1760-u176Cu176E-u1770u1780-u17B3u17D7u17DCu1820-u1877u1880-u18A8u18AAu18B0-u18F5u1900-u191Eu1950-u196Du1970-u1974u1980-u19ABu19B0-u19C9u1A00-u1A16u1A20-u1A54u1AA7u1B05-u1B33u1B45-u1B4Bu1B83-u1BA0u1BAEu1BAFu1BBA-u1BE5u1C00-u1C23u1C4D-u1C4Fu1C5A-u1C7Du1CE9-u1CECu1CEE-u1CF1u1CF5u1CF6u1D00-u1DBFu1E00-u1F15u1F18-u1F1Du1F20-u1F45u1F48-u1F4Du1F50-u1F57u1F59u1F5Bu1F5Du1F5F-u1F7Du1F80-u1FB4u1FB6-u1FBCu1FBEu1FC2-u1FC4u1FC6-u1FCCu1FD0-u1FD3u1FD6-u1FDBu1FE0-u1FECu1FF2-u1FF4u1FF6-u1FFCu2071u207Fu2090-u209Cu2102u2107u210A-u2113u2115u2119-u211Du2124u2126u2128u212A-u212Du212F-u2139u213C-u213Fu2145-u2149u214Eu2183u2184u2C00-u2C2Eu2C30-u2C5Eu2C60-u2CE4u2CEB-u2CEEu2CF2u2CF3u2D00-u2D25u2D27u2D2Du2D30-u2D67u2D6Fu2D80-u2D96u2DA0-u2DA6u2DA8-u2DAEu2DB0-u2DB6u2DB8-u2DBEu2DC0-u2DC6u2DC8-u2DCEu2DD0-u2DD6u2DD8-u2DDEu2E2Fu3005u3006u3031-u3035u303Bu303Cu3041-u3096u309D-u309Fu30A1-u30FAu30FC-u30FFu3105-u312Du3131-u318Eu31A0-u31BAu31F0-u31FFu3400-u4DB5u4E00-u9FD5uA000-uA48CuA4D0-uA4FDuA500-uA60CuA610-uA61FuA62AuA62BuA640-uA66EuA67F-uA69DuA6A0-uA6E5uA717-uA71FuA722-uA788uA78B-uA7ADuA7B0-uA7B7uA7F7-uA801uA803-uA805uA807-uA80AuA80C-uA822uA840-uA873uA882-uA8B3uA8F2-uA8F7uA8FBuA8FDuA90A-uA925uA930-uA946uA960-uA97CuA984-uA9B2uA9CFuA9E0-uA9E4uA9E6-uA9EFuA9FA-uA9FEuAA00-uAA28uAA40-uAA42uAA44-uAA4BuAA60-uAA76uAA7AuAA7E-uAAAFuAAB1uAAB5uAAB6uAAB9-uAABDuAAC0uAAC2uAADB-uAADDuAAE0-uAAEAuAAF2-uAAF4uAB01-uAB06uAB09-uAB0EuAB11-uAB16uAB20-uAB26uAB28-uAB2EuAB30-uAB5AuAB5C-uAB65uAB70-uABE2uAC00-uD7A3uD7B0-uD7C6uD7CB-uD7FBuF900-uFA6DuFA70-uFAD9uFB00-uFB06uFB13-uFB17uFB1DuFB1F-uFB28uFB2A-uFB36uFB38-uFB3CuFB3EuFB40uFB41uFB43uFB44uFB46-uFBB1uFBD3-uFD3DuFD50-uFD8FuFD92-uFDC7uFDF0-uFDFBuFE70-uFE74uFE76-uFEFCuFF21-uFF3AuFF41-uFF5AuFF66-uFFBEuFFC2-uFFC7uFFCA-uFFCFuFFD2-uFFD7uFFDA-uFFDC]

and w is a combination of p{L}, _ and a p{N}, see p{N} below:

[0-9xB2xB3xB9xBC-xBEu0660-u0669u06F0-u06F9u07C0-u07C9u0966-u096Fu09E6-u09EFu09F4-u09F9u0A66-u0A6Fu0AE6-u0AEFu0B66-u0B6Fu0B72-u0B77u0BE6-u0BF2u0C66-u0C6Fu0C78-u0C7Eu0CE6-u0CEFu0D66-u0D75u0DE6-u0DEFu0E50-u0E59u0ED0-u0ED9u0F20-u0F33u1040-u1049u1090-u1099u1369-u137Cu16EE-u16F0u17E0-u17E9u17F0-u17F9u1810-u1819u1946-u194Fu19D0-u19DAu1A80-u1A89u1A90-u1A99u1B50-u1B59u1BB0-u1BB9u1C40-u1C49u1C50-u1C59u2070u2074-u2079u2080-u2089u2150-u2182u2185-u2189u2460-u249Bu24EA-u24FFu2776-u2793u2CFDu3007u3021-u3029u3038-u303Au3192-u3195u3220-u3229u3248-u324Fu3251-u325Fu3280-u3289u32B1-u32BFuA620-uA629uA6E6-uA6EFuA830-uA835uA8D0-uA8D9uA900-uA909uA9D0-uA9D9uA9F0-uA9F9uAA50-uAA59uABF0-uABF9uFF10-uFF19]

and whitespace is something like

[fv
x20xA0u1680u2000-u200Au2028u2029u202Fu205Fu3000]

Note there can be issues with diacritic matching in ES5 regex syntax.

UPDATE

twitter-text C# Adaptation

The Java library features the following regex for the hashtags:

VALID_HASHTAG = Pattern.compile("(^|[^&" + HASHTAG_LETTERS_NUMERALS + "])(#|uFF03)(?!uFE0F|u20E3)(" + HASHTAG_LETTERS_NUMERALS_SET + "*" + HASHTAG_LETTERS_SET + HASHTAG_LETTERS_NUMERALS_SET + "*)", Pattern.CASE_INSENSITIVE);

Translating into C#:

string HASHTAG_LETTERS = @"p{L}p{M}";
string HASHTAG_NUMERALS = @"p{Nd}";
string HASHTAG_SPECIAL_CHARS = @"_u200cu200dua67eu05beu05f3u05f4uff5eu301cu309bu309cu30a0u30fbu3003u0f0bu0f0cu00b7";
string HASHTAG_LETTERS_NUMERALS = HASHTAG_LETTERS + HASHTAG_NUMERALS + HASHTAG_SPECIAL_CHARS;
string HASHTAG_LETTERS_NUMERALS_SET = "[" + HASHTAG_LETTERS_NUMERALS + "]";
string HASHTAG_LETTERS_SET = "[" + HASHTAG_LETTERS + "]";
string VALID_HASHTAG = new Regex("(^|[^&" + HASHTAG_LETTERS_NUMERALS + @"])(#|uFF03)(?!uFE0F|u20E3)(" + HASHTAG_LETTERS_NUMERALS_SET + "*" + HASHTAG_LETTERS_SET + HASHTAG_LETTERS_NUMERALS_SET + "*)", RegexOptions.IgnoreCase);

And here is a testing C# demo:

using System;
using System.Linq;
using System.Collections.Generic;
using System.Text.RegularExpressions;
public class Test
{
    public static void Main()
    {
        string HASHTAG_LETTERS = @"p{L}p{M}";
        string HASHTAG_NUMERALS = @"p{Nd}";
        string HASHTAG_SPECIAL_CHARS = @"_u200cu200dua67eu05beu05f3u05f4uff5eu301cu309bu309cu30a0u30fbu3003u0f0bu0f0cu00b7";
        string HASHTAG_LETTERS_NUMERALS = HASHTAG_LETTERS + HASHTAG_NUMERALS + HASHTAG_SPECIAL_CHARS;
        string HASHTAG_LETTERS_NUMERALS_SET = "[" + HASHTAG_LETTERS_NUMERALS + "]";
        string HASHTAG_LETTERS_SET = "[" + HASHTAG_LETTERS + "]";
        Regex VALID_HASHTAG = new Regex("(^|[^&" + HASHTAG_LETTERS_NUMERALS + @"])(#|uFF03)(?!uFE0F|u20E3)(" + HASHTAG_LETTERS_NUMERALS_SET + "*" + HASHTAG_LETTERS_SET + HASHTAG_LETTERS_NUMERALS_SET + "*)", RegexOptions.IgnoreCase);
        
        List<string> tests = new List<string>() {"#hashtag",
            "#Az?rbaycanca",
            "#m?∥ae",
            "#?e?tina",
            "#?aoi?ín",
            "#Caoi?ín",
            "#ta?im",
            "#hag?ua",
            "#café",
            "#?????",
            "#??????",
            "#?????????",
            "#????",
            "#???",
            "#???????",
            "#??????",
            "#??????",
            "#?????????",
            "#???",
            "#日本語ハッシュタグ",
            "#日本語ハッシュタグ",
            "これはOK #ハッシュタグ",
            "これもOK。#ハッシュタグ",
            "これはダメ#ハッシュタグ",
            "#1",
            "#2"};
        tests.ForEach(input => // JUST A PIECE OF DEMO CODE
            Console.WriteLine("Input: " + input + " = " + VALID_HASHTAG.IsMatch(input) +
              (VALID_HASHTAG.IsMatch(input) ? ", match = " + VALID_HASHTAG.Match(input).Value : "")));
    }
}

JavaScript Hashtag validation

If you use JS Twitter library, identifying hasgtags can be done with a mere:

var isValidHashtag = twttr.txt.isValidHashtag(hashtag);

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...